|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
f26080d4 |
| 03-Oct-2024 |
Jeffrey Ji <[email protected]> |
net_sched: sch_fq: add the ability to offload pacing
Some network devices have the ability to offload EDT (Earliest Departure Time) which is the model used for TCP pacing and FQ packet scheduler.
S
net_sched: sch_fq: add the ability to offload pacing
Some network devices have the ability to offload EDT (Earliest Departure Time) which is the model used for TCP pacing and FQ packet scheduler.
Some of them implement the timing wheel mechanism described in https://saeed.github.io/files/carousel-sigcomm17.pdf with an associated 'timing wheel horizon'.
This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON attribute.
Its value is capped by the device max_pacing_offload_horizon, added in the prior patch.
It allows FQ to let packets within pacing offload horizon to be delivered to the device, which will handle the needed delay without host involvement.
Signed-off-by: Jeffrey Ji <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7 |
|
| #
33241dca |
| 23-Dec-2023 |
Jamal Hadi Salim <[email protected]> |
net/sched: Remove uapi support for CBQ qdisc
Commit 051d44209842 ("net/sched: Retire CBQ qdisc") retired the CBQ qdisc. Remove UAPI for it. Iproute2 will sync by equally removing it from user space.
net/sched: Remove uapi support for CBQ qdisc
Commit 051d44209842 ("net/sched: Retire CBQ qdisc") retired the CBQ qdisc. Remove UAPI for it. Iproute2 will sync by equally removing it from user space.
Reviewed-by: Victor Nogueira <[email protected]> Reviewed-by: Pedro Tammela <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
26cc8714 |
| 23-Dec-2023 |
Jamal Hadi Salim <[email protected]> |
net/sched: Remove uapi support for ATM qdisc
Commit fb38306ceb9e ("net/sched: Retire ATM qdisc") retired the ATM qdisc. Remove UAPI for it. Iproute2 will sync by equally removing it from user space.
net/sched: Remove uapi support for ATM qdisc
Commit fb38306ceb9e ("net/sched: Retire ATM qdisc") retired the ATM qdisc. Remove UAPI for it. Iproute2 will sync by equally removing it from user space.
Reviewed-by: Victor Nogueira <[email protected]> Reviewed-by: Pedro Tammela <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
fe3b739a |
| 23-Dec-2023 |
Jamal Hadi Salim <[email protected]> |
net/sched: Remove uapi support for dsmark qdisc
Commit bbe77c14ee61 ("net/sched: Retire dsmark qdisc") retired the dsmark classifier. Remove UAPI support for it. Iproute2 will sync by equally removi
net/sched: Remove uapi support for dsmark qdisc
Commit bbe77c14ee61 ("net/sched: Retire dsmark qdisc") retired the dsmark classifier. Remove UAPI support for it. Iproute2 will sync by equally removing it from user space.
Reviewed-by: Victor Nogueira <[email protected]> Reviewed-by: Pedro Tammela <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5 |
|
| #
49e7265f |
| 02-Oct-2023 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: add TCA_FQ_WEIGHTS attribute
This attribute can be used to tune the per band weight and report them in "tc qdisc show" output:
qdisc fq 802f: parent 1:9 limit 100000p flow_limit
net_sched: sch_fq: add TCA_FQ_WEIGHTS attribute
This attribute can be used to tune the per band weight and report them in "tc qdisc show" output:
qdisc fq 802f: parent 1:9 limit 100000p flow_limit 500p buckets 1024 orphan_mask 1023 quantum 8364b initial_quantum 41820b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 weights 589824 196608 65536 Sent 236460814 bytes 792991 pkt (dropped 0, overlimits 0 requeues 0) rate 25816bit 10pps backlog 0b 0p requeues 0 flows 4 (inactive 4 throttled 0) gc 0 throttled 19 latency 17.6us fastpath 773882
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Dave Taht <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
29f834aa |
| 02-Oct-2023 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: add 3 bands and WRR scheduling
Before Google adopted FQ for its production servers, we had to ensure AF4 packets would get a higher share than BE1 ones.
As discussed this week in
net_sched: sch_fq: add 3 bands and WRR scheduling
Before Google adopted FQ for its production servers, we had to ensure AF4 packets would get a higher share than BE1 ones.
As discussed this week in Netconf 2023 in Paris, it is time to upstream this for public use.
After this patch FQ can replace pfifo_fast, with the following differences :
- FQ uses WRR instead of strict prio, to avoid starvation of low priority packets.
- We make sure each band/prio tracks its own usage against sch->limit. This was done to make sure flood of low priority packets would not prevent AF4 packets to be queued. Contributed by Willem.
- priomap can be changed, if needed (default value are the ones coming from pfifo_fast).
In this patch, we set default band weights so that :
- high prio (band=0) packets get 90% of the bandwidth if they compete with low prio (band=2) packets.
- high prio packets get 75% of the bandwidth if they compete with medium prio (band=1) packets.
Following patch in this series adds the possibility to tune the per-band weights.
As we added many fields in 'struct fq_sched_data', we had to make sure to have the first cache line read-mostly, and avoid wasting precious cache lines.
More optimizations are possible but will be sent separately.
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Dave Taht <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc4, v6.6-rc3 |
|
| #
076433bd |
| 20-Sep-2023 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: add fast path for mostly idle qdisc
TCQ_F_CAN_BYPASS can be used by few qdiscs.
Idea is that if we queue a packet to an empty qdisc, following dequeue() would pick it immediately
net_sched: sch_fq: add fast path for mostly idle qdisc
TCQ_F_CAN_BYPASS can be used by few qdiscs.
Idea is that if we queue a packet to an empty qdisc, following dequeue() would pick it immediately.
FQ can not use the generic TCQ_F_CAN_BYPASS code, because some additional checks need to be performed.
This patch adds a similar fast path to FQ.
Most of the time, qdisc is not throttled, and many packets can avoid bringing/touching at least four cache lines, and consuming 128bytes of memory to store the state of a flow.
After this patch, netperf can send UDP packets about 13 % faster, and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC.
TCP traffic is also improved, thanks to a reduction of cache line misses. I have measured a 5 % increase of throughput on a tcp_rr intensive workload.
tc -s -d qd sh dev eth1 ... qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 flows 122 (inactive 122 throttled 0) gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7 |
|
| #
4072d97d |
| 15-Aug-2023 |
François Michel <[email protected]> |
netem: add prng attribute to netem_sched_data
Add prng attribute to struct netem_sched_data and allows setting the seed of the PRNG through netlink using the new TCA_NETEM_PRNG_SEED attribute. The P
netem: add prng attribute to netem_sched_data
Add prng attribute to struct netem_sched_data and allows setting the seed of the PRNG through netlink using the new TCA_NETEM_PRNG_SEED attribute. The PRNG attribute is not actually used yet.
Signed-off-by: François Michel <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5 |
|
| #
6c1adb65 |
| 30-May-2023 |
Vladimir Oltean <[email protected]> |
net/sched: taprio: add netlink reporting for offload statistics counters
Offloading drivers may report some additional statistics counters, some of them even suggested by 802.1Q, like TransmissionOv
net/sched: taprio: add netlink reporting for offload statistics counters
Offloading drivers may report some additional statistics counters, some of them even suggested by 802.1Q, like TransmissionOverrun.
In my opinion we don't have to limit ourselves to reporting counters only globally to the Qdisc/interface, especially if the device has more detailed reporting (per traffic class), since the more detailed info is valuable for debugging and can help identifying who is exceeding its time slot.
But on the other hand, some devices may not be able to report both per TC and global stats.
So we end up reporting both ways, and use the good old ethtool_put_stat() strategy to determine which statistics are supported by this NIC. Statistics which aren't set are simply not reported to netlink. For this reason, we need something dynamic (a nlattr nest) to be reported through TCA_STATS_APP, and not something daft like the fixed-size and inextensible struct tc_codel_xstats. A good model for xstats which are a nlattr nest rather than a fixed struct seems to be cake.
# Global stats $ tc -s qdisc show dev eth0 root # Per-tc stats $ tc -s class show dev eth0
Signed-off-by: Vladimir Oltean <[email protected]> Acked-by: Vinicius Costa Gomes <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7 |
|
| #
a721c3e5 |
| 11-Apr-2023 |
Vladimir Oltean <[email protected]> |
net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload structu
net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload structure embedded within tc_taprio_qopt_offload. So practically, if a device driver is written to treat the mqprio portion of taprio just like standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink attribute parsing, which is hard to transform into generic code without having something that stinks as a result. We have the same variables with the same semantics, just different nlattr type values (TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12; TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of nlattr pointers is allocated statically, up to the maximum understood nlattr type. That array size is hardcoded as a constant, but when transforming this into a common parsing function, it would become either a VLA (which the Linux kernel rightfully doesn't like) or a call to the allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE" and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE" use cases. HOLD and RELEASE events are emitted towards the underlying MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space, one can distinguish between the 2 use cases by choosing whether to use the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu nlattr parsing to put all logic under the "if" that tests for presence of that nlattr.
Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Ferenc Fejes <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
f62af20b |
| 11-Apr-2023 |
Vladimir Oltean <[email protected]> |
net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each packet priority can be assigned to a "frame preemption status" value o
net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each packet priority can be assigned to a "frame preemption status" value of either "express" or "preemptible". Express priorities are transmitted by the local device through the eMAC, and preemptible priorities through the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge layer).
The FP adminStatus is defined per packet priority, but 802.1Q clause 12.30.1.1.1 framePreemptionAdminStatus also says that:
| Priorities that all map to the same traffic class should be | constrained to use the same value of preemption status.
It is impossible to ignore the cognitive dissonance in the standard here, because it practically means that the FP adminStatus only takes distinct values per traffic class, even though it is defined per priority.
I can see no valid use case which is prevented by having the kernel take the FP adminStatus as input per traffic class (what we do here). In addition, this also enforces the above constraint by construction. User space network managers which wish to expose FP adminStatus per priority are free to do so; they must only observe the prio_tc_map of the netdev (which presumably is also under their control, when constructing the mqprio netlink attributes).
The reason for configuring frame preemption as a property of the Qdisc layer is that the information about "preemptible TCs" is closest to the place which handles the num_tc and prio_tc_map of the netdev. If the UAPI would have been any other layer, it would be unclear what to do with the FP information when num_tc collapses to 0. A key assumption is that only mqprio/taprio change the num_tc and prio_tc_map of the netdev. Not sure if that's a great assumption to make.
Having FP in tc-mqprio can be seen as an implementation of the use case defined in 802.1Q Annex S.2 "Preemption used in isolation". There will be a separate implementation of FP in tc-taprio, for the other use cases.
Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Ferenc Fejes <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0 |
|
| #
a54fc09e |
| 28-Sep-2022 |
Vladimir Oltean <[email protected]> |
net/sched: taprio: allow user input of per-tc max SDU
IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the exi
net/sched: taprio: allow user input of per-tc max SDU
IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the existence of a per traffic class limitation of maximum frame sizes, with a fallback on the port-based MTU.
As far as I am able to understand, the 802.1Q Service Data Unit (SDU) represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding any number of prepended VLAN headers which may be otherwise present in the MSDU. Therefore, the queueMaxSDU is directly comparable to the device MTU (1500 means L2 payload sizes are accepted, or frame sizes of 1518 octets, or 1522 plus one VLAN header). Drivers which offload this are directly responsible of translating into other units of measurement.
To keep the fast path checks optimized, we keep 2 arrays in the qdisc, one for max_sdu translated into frame length (so that it's comparable to skb->len), and another for offloading and for dumping back to the user.
Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7 |
|
| #
dfcb63ce |
| 19-Oct-2021 |
Toke Høiland-Jørgensen <[email protected]> |
fq_codel: generalise ce_threshold marking for subset of traffic
Commit e72aeb9ee0e3 ("fq_codel: implement L4S style ce_threshold_ect1 marking") expanded the ce_threshold feature of FQ-CoDel so it ca
fq_codel: generalise ce_threshold marking for subset of traffic
Commit e72aeb9ee0e3 ("fq_codel: implement L4S style ce_threshold_ect1 marking") expanded the ce_threshold feature of FQ-CoDel so it can be applied to a subset of the traffic, using the ECT(1) bit of the ECN field as the classifier. However, hard-coding ECT(1) as the only classifier for this feature seems limiting, so let's expand it to be more general.
To this end, change the parameter from a ce_threshold_ect1 boolean, to a one-byte selector/mask pair (ce_threshold_{selector,mask}) which is applied to the whole diffserv/ECN field in the IP header. This makes it possible to classify packets by any value in either the ECN field or the diffserv field. In particular, setting a selector of INET_ECN_ECT_1 and a mask of INET_ECN_MASK corresponds to the functionality before this patch, and a mask of ~INET_ECN_MASK allows using the selector as a straight-forward match against a diffserv code point:
# apply ce_threshold to ECT(1) traffic tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x1/0x3
# apply ce_threshold to ECN-capable traffic marked as diffserv AF22 tc qdisc replace dev eth0 root fq_codel ce_threshold 1ms ce_threshold_selector 0x50/0xfc
Regardless of the selector chosen, the normal rules for ECN-marking of packets still apply, i.e., the flow must still declare itself ECN-capable by setting one of the bits in the ECN field to get marked at all.
v2: - Add tc usage examples to patch description
Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc6 |
|
| #
e72aeb9e |
| 14-Oct-2021 |
Eric Dumazet <[email protected]> |
fq_codel: implement L4S style ce_threshold_ect1 marking
Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency, Low Loss, Scalable Throughput (L4S) style marking, along with ce_thre
fq_codel: implement L4S style ce_threshold_ect1 marking
Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency, Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
If enabled, only packets with ECT(1) can be transformed to CE if their sojourn time is above the ce_threshold.
Note that this new option does not change rules for codel law. In particular, if TCA_FQ_CODEL_ECN is left enabled (this is the default when fq_codel qdisc is created), ECT(0) packets can still get CE if codel law (as governed by limit/target) decides so.
Section 4.3.b of current draft [1] states:
b. A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can be used for L4S. For instance within each queue of an FQ-CoDel system, as well as a CoDel AQM, there is typically also ECN marking at an immediate (unsmoothed) shallow threshold to support use in data centres (see Sec.5.2.7 of [RFC8290]). This can be modified so that the shallow threshold is solely applied to ECT(1) packets. Then if there is a flow of non-ECN or ECT(0) packets in the per-flow-queue, the Classic AQM (e.g. CoDel) is applied; while if there is a flow of ECT(1) packets in the queue, the shallower (typically sub-millisecond) threshold is applied.
Tested:
tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec
netperf ... -t TCP_STREAM -- K dctcp
tc -s -d qd sh dev eth1 qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013) backlog 0b 0p requeues 152013 maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639 new_flows_len 0 old_flows_len 0
[1] L4S current draft: https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch
Signed-off-by: Eric Dumazet <[email protected]> Cc: Neal Cardwell <[email protected]> Cc: Ingemar Johansson S <[email protected]> Cc: Tom Henderson <[email protected]> Cc: Bob Briscoe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
c7c5e6ff |
| 03-Sep-2021 |
Eric Dumazet <[email protected]> |
fq_codel: reject silly quantum parameters
syzbot found that forcing a big quantum attribute would crash hosts fast, essentially using this:
tc qd replace dev eth0 root fq_codel quantum 4294967295
fq_codel: reject silly quantum parameters
syzbot found that forcing a big quantum attribute would crash hosts fast, essentially using this:
tc qd replace dev eth0 root fq_codel quantum 4294967295
This is because fq_codel_dequeue() would have to loop ~2^31 times in :
if (flow->deficit <= 0) { flow->deficit += q->quantum; list_move_tail(&flow->flowchain, &q->old_flows); goto begin; }
SFQ max quantum is 2^19 (half a megabyte) Lets adopt a max quantum of one megabyte for FQ_CODEL.
Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5 |
|
| #
d03b195b |
| 19-Jan-2021 |
Maxim Mikityanskiy <[email protected]> |
sch_htb: Hierarchical QoS hardware offload
HTB doesn't scale well because of contention on a single lock, and it also consumes CPU. This patch adds support for offloading HTB to hardware that suppor
sch_htb: Hierarchical QoS hardware offload
HTB doesn't scale well because of contention on a single lock, and it also consumes CPU. This patch adds support for offloading HTB to hardware that supports hierarchical rate limiting.
In the offload mode, HTB passes control commands to the driver using ndo_setup_tc. The driver has to replicate the whole hierarchy of classes and their settings (rate, ceil) in the NIC. Every modification of the HTB tree caused by the admin results in ndo_setup_tc being called.
After this setup, the HTB algorithm is done completely in the NIC. An SQ (send queue) is created for every leaf class and attached to the hierarchy, so that the NIC can calculate and obey aggregated rate limits, too. In the future, it can be changed, so that multiple SQs will back a single leaf class.
ndo_select_queue is responsible for selecting the right queue that serves the traffic class of each packet.
The data path works as follows: a packet is classified by clsact, the driver selects a hardware queue according to its class, and the packet is enqueued into this queue's qdisc.
This solution addresses two main problems of scaling HTB:
1. Contention by flow classification. Currently the filters are attached to the HTB instance as follows:
# tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
It's possible to move classification to clsact egress hook, which is thread-safe and lock-free:
# tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
This way classification still happens in software, but the lock contention is eliminated, and it happens before selecting the TX queue, allowing the driver to translate the class to the corresponding hardware queue in ndo_select_queue.
Note that this is already compatible with non-offloaded HTB and doesn't require changes to the kernel nor iproute2.
2. Contention by handling packets. HTB is not multi-queue, it attaches to a whole net device, and handling of all packets takes the same lock. When HTB is offloaded, it registers itself as a multi-queue qdisc, similarly to mq: HTB is attached to the netdev, and each queue has its own qdisc.
Some features of HTB may be not supported by some particular hardware, for example, the maximum number of classes may be limited, the granularity of rate and ceil parameters may be different, etc. - so, the offload is not enabled by default, a new parameter is used to enable it:
# tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: Maxim Mikityanskiy <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3 |
|
| #
aee9caa0 |
| 26-Jun-2020 |
Petr Machata <[email protected]> |
net: sched: sch_red: Add qevents "early_drop" and "mark"
In order to allow acting on dropped and/or ECN-marked packets, add two new qevents to the RED qdisc: "early_drop" and "mark". Filters attache
net: sched: sch_red: Add qevents "early_drop" and "mark"
In order to allow acting on dropped and/or ECN-marked packets, add two new qevents to the RED qdisc: "early_drop" and "mark". Filters attached at "early_drop" block are executed as packets are early-dropped, those attached at the "mark" block are executed as packets are ECN-marked.
Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark" qevent. Absence of these attributes signifies "don't care": no block is allocated in that case, or the existing blocks are left intact in case of the change callback.
For purposes of offloading, blocks attached to these qevents appear with newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and FLOW_BLOCK_BINDER_TYPE_RED_MARK.
Signed-off-by: Petr Machata <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4 |
|
| #
39d01050 |
| 01-May-2020 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: add horizon attribute
QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN, to efficiently pace UDP packets.
As far as sch_fq is concerned, we need to add safet
net_sched: sch_fq: add horizon attribute
QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN, to efficiently pace UDP packets.
As far as sch_fq is concerned, we need to add safety checks, so that a buggy application does not fill the qdisc with packets having delivery time far in the future.
This patch adds a configurable horizon (default: 10 seconds), and a configurable policy when a packet is beyond the horizon at enqueue() time: - either drop the packet (default policy) - or cap its delivery time to the horizon.
$ tc -s -d qd sh dev eth0 qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit refill_delay 40.0ms timer_slack 10.000us horizon 10.000s Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6) backlog 0b 0p requeues 6 flows 1191 (inactive 1177 throttled 0) gc 0 highprio 0 throttled 692 latency 11.480us pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0
v2: fixed an overflow on 32bit kernels in fq_init(), reported by kbuild test robot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]> Cc: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6 |
|
| #
673040c3 |
| 24-Mar-2020 |
Eugene Syromiatnikov <[email protected]> |
taprio: do not use BIT() in TCA_TAPRIO_ATTR_FLAG_* definitions
BIT() macro definition is internal to the Linux kernel and is not to be used in UAPI headers; replace its usage with the _BITUL() macro
taprio: do not use BIT() in TCA_TAPRIO_ATTR_FLAG_* definitions
BIT() macro definition is internal to the Linux kernel and is not to be used in UAPI headers; replace its usage with the _BITUL() macro that is already used elsewhere in the header.
Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading") Signed-off-by: Eugene Syromiatnikov <[email protected]> Acked-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc7 |
|
| #
583396f4 |
| 17-Mar-2020 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: enable use of hrtimer slack
Add a new attribute to control the fq qdisc hrtimer slack.
Default is set to 10 usec.
When/if packets are throttled, fq set up an hrtimer that can le
net_sched: sch_fq: enable use of hrtimer slack
Add a new attribute to control the fq qdisc hrtimer slack.
Default is set to 10 usec.
When/if packets are throttled, fq set up an hrtimer that can lead to one interrupt per packet in the throttled queue.
By using a timer slack, we allow better use of timer interrupts, by giving them a chance to call multiple timer callbacks at each hardware interrupt.
Also, giving a slack allows FQ to dequeue batches of packets instead of a single one, thus increasing xmit_more efficiency.
This has no negative effect on the rate a TCP flow can sustain, since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)
v2: added strict netlink checking (as feedback from Jakub Kicinski)
Tested: 1000 concurrent flows all using paced packets. 1,000,000 packets sent per second.
Before the patch :
$ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 60726784 23628 3485992 0 0 138 1 977 535 0 12 87 0 0 0 0 0 60714700 23628 3485628 0 0 0 0 1568827 26462 0 22 78 0 0 1 0 0 60716012 23628 3485656 0 0 0 0 1570034 26216 0 22 78 0 0 0 0 0 60722420 23628 3485492 0 0 0 0 1567230 26424 0 22 78 0 0 0 0 0 60727484 23628 3485556 0 0 0 0 1568220 26200 0 22 78 0 0 2 0 0 60718900 23628 3485380 0 0 0 40 1564721 26630 0 22 78 0 0 2 0 0 60718096 23628 3485332 0 0 0 0 1562593 26432 0 22 78 0 0 0 0 0 60719608 23628 3485064 0 0 0 0 1563806 26238 0 22 78 0 0 1 0 0 60722876 23628 3485236 0 0 0 130 1565874 26566 0 22 78 0 0 1 0 0 60722752 23628 3484908 0 0 0 0 1567646 26247 0 22 78 0 0
After the patch, slack of 10 usec, we can see a reduction of interrupts per second, and a small decrease of reported cpu usage.
$ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 60722564 23628 3484728 0 0 133 1 696 545 0 13 87 0 0 1 0 0 60722568 23628 3484824 0 0 0 0 977278 25469 0 20 80 0 0 0 0 0 60716396 23628 3484764 0 0 0 0 979997 25326 0 20 80 0 0 0 0 0 60713844 23628 3484960 0 0 0 0 981394 25249 0 20 80 0 0 2 0 0 60720468 23628 3484916 0 0 0 0 982860 25062 0 20 80 0 0 1 0 0 60721236 23628 3484856 0 0 0 0 982867 25100 0 20 80 0 0 1 0 0 60722400 23628 3484456 0 0 0 8 982698 25303 0 20 80 0 0 0 0 0 60715396 23628 3484428 0 0 0 0 981777 25176 0 20 80 0 0 0 0 0 60716520 23628 3486544 0 0 0 36 978965 27857 0 21 79 0 0 0 0 0 60719592 23628 3486516 0 0 0 22 977318 25106 0 20 80 0 0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc6 |
|
| #
0a7fad23 |
| 12-Mar-2020 |
Petr Machata <[email protected]> |
net: sched: RED: Introduce an ECN nodrop mode
When the RED Qdisc is currently configured to enable ECN, the RED algorithm is used to decide whether a certain SKB should be marked. If that SKB is not
net: sched: RED: Introduce an ECN nodrop mode
When the RED Qdisc is currently configured to enable ECN, the RED algorithm is used to decide whether a certain SKB should be marked. If that SKB is not ECN-capable, it is early-dropped.
It is also possible to keep all traffic in the queue, and just mark the ECN-capable subset of it, as appropriate under the RED algorithm. Some switches support this mode, and some installations make use of it.
To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is configured with this flag, non-ECT traffic is enqueued instead of being early-dropped.
Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
14bc175d |
| 12-Mar-2020 |
Petr Machata <[email protected]> |
net: sched: Allow extending set of supported RED flags
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool of global RED flags. These are passed in tc_red_qopt.flags. However
net: sched: Allow extending set of supported RED flags
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool of global RED flags. These are passed in tc_red_qopt.flags. However none of these qdiscs validate the flag field, and just copy it over wholesale to internal structures, and later dump it back. (An exception is GRED, which does validate for VQs -- however not for the main setup.)
A broken userspace can therefore configure a qdisc with arbitrary unsupported flags, and later expect to see the flags on qdisc dump. The current ABI therefore allows storage of several bits of custom data to qdisc instances of the types mentioned above. How many bits, depends on which flags are meaningful for the qdisc in question. E.g. SFQ recognizes flags ECN and HARDDROP, and the rest is not interpreted.
If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it, and at the same time it needs to retain the possibility to store 6 bits of uninterpreted data. Likewise RED, which adds a new flag later in this patchset.
To that end, this patch adds a new function, red_get_flags(), to split the passed flags of RED-like qdiscs to flags and user bits, and red_validate_flags() to validate the resulting configuration. It further adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.
Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5 |
|
| #
ec97ecf1 |
| 22-Jan-2020 |
Mohit P. Tahiliani <[email protected]> |
net: sched: add Flow Queue PIE packet scheduler
Principles: - Packets are classified on flows. - This is a Stochastic model (as we use a hash, several flows might
net: sched: add Flow Queue PIE packet scheduler
Principles: - Packets are classified on flows. - This is a Stochastic model (as we use a hash, several flows might be hashed to the same slot) - Each flow has a PIE managed queue. - Flows are linked onto two (Round Robin) lists, so that new flows have priority on old ones. - For a given flow, packets are not reordered. - Drops during enqueue only. - ECN capability is off by default. - ECN threshold (if ECN is enabled) is at 10% by default. - Uses timestamps to calculate queue delay by default.
Usage: tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ] [ target TIME ] [ tupdate TIME ] [ alpha NUMBER ] [ beta NUMBER ] [ quantum BYTES ] [ memory_limit BYTES ] [ ecnprob PERCENTAGE ] [ [no]ecn ] [ [no]bytemode ] [ [no_]dq_rate_estimator ]
defaults: limit: 10240 packets, flows: 1024 target: 15 ms, tupdate: 15 ms (in jiffies) alpha: 1/8, beta : 5/4 quantum: device MTU, memory_limit: 32 Mb ecnprob: 10%, ecn: off bytemode: off, dq_rate_estimator: off
Signed-off-by: Mohit P. Tahiliani <[email protected]> Signed-off-by: Sachin D. Patil <[email protected]> Signed-off-by: V. Saicharan <[email protected]> Signed-off-by: Mohit Bhasi <[email protected]> Signed-off-by: Leslie Monis <[email protected]> Signed-off-by: Gautam Ramakrishnan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3 |
|
| #
dcc68b4d |
| 18-Dec-2019 |
Petr Machata <[email protected]> |
net: sch_ets: Add a new Qdisc
Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is PRIO-like in how it is configured, meaning one needs to specify how many bands there are, how many
net: sch_ets: Add a new Qdisc
Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is PRIO-like in how it is configured, meaning one needs to specify how many bands there are, how many are strict and how many are dwrr, quanta for the latter, and priomap.
The new Qdisc operates like the PRIO / DRR combo would when configured as per the standard. The strict classes, if any, are tried for traffic first. When there's no traffic in any of the strict queues, the ETS ones (if any) are treated in the same way as in DRR.
Signed-off-by: Petr Machata <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.5-rc2, v5.5-rc1, v5.4 |
|
| #
cec2975f |
| 20-Nov-2019 |
Gautam Ramakrishnan <[email protected]> |
net: sched: pie: enable timestamp based delay calculation
RFC 8033 suggests an alternative approach to calculate the queue delay in PIE by using a timestamp on every enqueued packet. This patch adds
net: sched: pie: enable timestamp based delay calculation
RFC 8033 suggests an alternative approach to calculate the queue delay in PIE by using a timestamp on every enqueued packet. This patch adds an implementation of that approach and sets it as the default method to calculate queue delay. The previous method (based on Little's law) to calculate queue delay is set as optional.
Signed-off-by: Gautam Ramakrishnan <[email protected]> Signed-off-by: Leslie Monis <[email protected]> Signed-off-by: Mohit P. Tahiliani <[email protected]> Acked-by: Dave Taht <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|