|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
1937a0be |
| 17-Mar-2025 |
Eric Dumazet <[email protected]> |
tcp: move icsk_clean_acked to a better location
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack().
tcp: move icsk_clean_acked to a better location
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack().
Rename it to tcp_clean_acked, move it to tcp_sock structure in the tcp_sock_read_rx for better cache locality in TCP fast path.
Define this field only when CONFIG_TLS_DEVICE is enabled saving 8 bytes on configs not using it.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
3ba07527 |
| 25-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: be less liberal in TSEcr received while in SYN_RECV state
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323)
As hinted by an old com
tcp: be less liberal in TSEcr received while in SYN_RECV state
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323)
As hinted by an old comment in tcp_check_req(), we can check the TSEcr value in the incoming packet corresponds to one of the SYNACK TSval values we have sent.
In this patch, I record the oldest and most recent values that SYNACK packets have used.
Send a challenge ACK if we receive a TSEcr outside of this range, and increase a new SNMP counter.
nstat -az | grep TSEcrRejected TcpExtTSEcrRejected 0 0.0
Due to TCP fastopen implementation, do not apply yet these checks for fastopen flows.
v2: No longer use req->num_timeout, but treq->snt_tsval_first to detect when first SYNACK is prepared. This means we make sure to not send an initial zero TSval. Make sure MPTCP and TCP selftests are passing. Change MIB name to TcpExtTSEcrRejected
v1: https://lore.kernel.org/netdev/CADVnQykD8i4ArpSZaPKaoNxLJ2if2ts9m4As+=Jvdkrgx1qMHw@mail.gmail.com/T/
Reported-by: Yong-Hao Zou <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
0a2cdeea |
| 04-Nov-2024 |
Menglong Dong <[email protected]> |
net: tcp: replace the document for "lsndtime" in tcp_sock
Commit d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") moved the fields around and misplaced the documentation for "lsndtime".
net: tcp: replace the document for "lsndtime" in tcp_sock
Commit d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") moved the fields around and misplaced the documentation for "lsndtime". So, let's replace it in the proper place.
Signed-off-by: Menglong Dong <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3 |
|
| #
d2c3a7eb |
| 05-Apr-2024 |
Eric Dumazet <[email protected]> |
tcp: more struct tcp_sock adjustments
tp->recvmsg_inq is used from tcp recvmsg() thus should be in tcp_sock_read_rx group.
tp->tcp_clock_cache and tp->tcp_mstamp are written both in rx and tx paths
tcp: more struct tcp_sock adjustments
tp->recvmsg_inq is used from tcp recvmsg() thus should be in tcp_sock_read_rx group.
tp->tcp_clock_cache and tp->tcp_mstamp are written both in rx and tx paths, thus are better placed in tcp_sock_write_txrx group.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7 |
|
| #
345a6e26 |
| 01-Mar-2024 |
Eric Dumazet <[email protected]> |
tcp: align tcp_sock_write_rx group
Stephen Rothwell and kernel test robot reported that some arches (parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx
tcp: align tcp_sock_write_rx group
Stephen Rothwell and kernel test robot reported that some arches (parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx group does not start with a hole.
While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE() since after the blamed commit, we went to 105 bytes.
Fixes: 99123622050f ("tcp: remove some holes in struct tcp_sock") Reported-by: Stephen Rothwell <[email protected]> Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/netdev/[email protected]/ Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Simon Horman <[email protected]> # build-tested Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
99123622 |
| 27-Feb-2024 |
Eric Dumazet <[email protected]> |
tcp: remove some holes in struct tcp_sock
By moving some fields around, this patch shrinks holes size from 56 to 32, saving 24 bytes on 64bit arches.
After the patch pahole gives the following for
tcp: remove some holes in struct tcp_sock
By moving some fields around, this patch shrinks holes size from 56 to 32, saving 24 bytes on 64bit arches.
After the patch pahole gives the following for 'struct tcp_sock':
/* size: 2304, cachelines: 36, members: 162 */ /* sum members: 2234, holes: 6, sum holes: 32 */ /* sum bitfield members: 34 bits, bit holes: 5, sum bit holes: 14 bits */ /* padding: 32 */ /* paddings: 3, sum paddings: 10 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 12 */
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5, v6.8-rc4 |
|
| #
666a877d |
| 08-Feb-2024 |
Eric Dumazet <[email protected]> |
tcp: move tp->tcp_usec_ts to tcp_sock_read_txrx group
tp->tcp_usec_ts is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Sign
tcp: move tp->tcp_usec_ts to tcp_sock_read_txrx group
tp->tcp_usec_ts is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Wei Wang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
119ff048 |
| 08-Feb-2024 |
Eric Dumazet <[email protected]> |
tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables")
tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <[email protected]> Cc: Coco Li <[email protected]> Cc: Wei Wang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5 |
|
| #
9396c4ee |
| 04-Dec-2023 |
Dmitry Safonov <[email protected]> |
net/tcp: Don't store TCP-AO maclen on reqsk
This extra check doesn't work for a handshake when SYN segment has (current_key.maclen != rnext_key.maclen). It could be amended to preserve rnext_key.mac
net/tcp: Don't store TCP-AO maclen on reqsk
This extra check doesn't work for a handshake when SYN segment has (current_key.maclen != rnext_key.maclen). It could be amended to preserve rnext_key.maclen instead of current_key.maclen, but that requires a lookup on listen socket.
Originally, this extra maclen check was introduced just because it was cheap. Drop it and convert tcp_request_sock::maclen into boolean tcp_request_sock::used_tcp_ao.
Fixes: 06b22ef29591 ("net/tcp: Wire TCP-AO to request sockets") Signed-off-by: Dmitry Safonov <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
d5fed5ad |
| 04-Dec-2023 |
Coco Li <[email protected]> |
tcp: reorganize tcp_sock fast path variables
The variables are organized according in the following way:
- TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly
tcp: reorganize tcp_sock fast path variables
The variables are organized according in the following way:
- TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly hotpath cache lines - TX read-write hotpath cache line - TXRX read-write hotpath cache line - RX read-write hotpath cache line
Fastpath cachelines end after rcvq_space.
Cache line boundaries are enforced only between read-mostly and read-write. That is, if read-mostly tx cachelines bleed into read-mostly txrx cachelines, we do not care. We care about the boundaries between read and write cachelines because we want to prevent false sharing.
Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 8
Suggested-by: Eric Dumazet <[email protected]> Reviewed-by: Wei Wang <[email protected]> Signed-off-by: Coco Li <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
cdbab623 |
| 31-Oct-2023 |
Eric Dumazet <[email protected]> |
tcp: fix fastopen code vs usec TS
After blamed commit, TFO client-ack-dropped-then-recovery-ms-timestamps packetdrill test failed.
David Morley and Neal Cardwell started investigating and Neal poin
tcp: fix fastopen code vs usec TS
After blamed commit, TFO client-ack-dropped-then-recovery-ms-timestamps packetdrill test failed.
David Morley and Neal Cardwell started investigating and Neal pointed that we had :
tcp_conn_request() tcp_try_fastopen() -> tcp_fastopen_create_child -> child = inet_csk(sk)->icsk_af_ops->syn_recv_sock() -> tcp_create_openreq_child() -> copy req_usec_ts from req: newtp->tcp_usec_ts = treq->req_usec_ts; // now the new TFO server socket always does usec TS, no matter // what the route options are... send_synack() -> tcp_make_synack() // disable tcp_rsk(req)->req_usec_ts if route option is not present: if (tcp_rsk(req)->req_usec_ts < 0) tcp_rsk(req)->req_usec_ts = dst_tcp_usec_ts(dst);
tcp_conn_request() has the initial dst, we can initialize tcp_rsk(req)->req_usec_ts there instead of later in send_synack();
This means tcp_rsk(req)->req_usec_ts can be a boolean.
Many thanks to David an Neal for their help.
Fixes: 614e8316aa4c ("tcp: add support for usec resolution in TCP TS values") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Suggested-by: Neal Cardwell <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: David Morley <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6 |
|
| #
06b22ef2 |
| 23-Oct-2023 |
Dmitry Safonov <[email protected]> |
net/tcp: Wire TCP-AO to request sockets
Now when the new request socket is created from the listening socket, it's recorded what MKT was used by the peer. tcp_rsk_used_ao() is a new helper for check
net/tcp: Wire TCP-AO to request sockets
Now when the new request socket is created from the listening socket, it's recorded what MKT was used by the peer. tcp_rsk_used_ao() is a new helper for checking if TCP-AO option was used to create the request socket. tcp_ao_copy_all_matching() will copy all keys that match the peer on the request socket, as well as preparing them for the usage (creating traffic keys).
Co-developed-by: Francesco Ruggeri <[email protected]> Signed-off-by: Francesco Ruggeri <[email protected]> Co-developed-by: Salam Noureddine <[email protected]> Signed-off-by: Salam Noureddine <[email protected]> Signed-off-by: Dmitry Safonov <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
decde258 |
| 23-Oct-2023 |
Dmitry Safonov <[email protected]> |
net/tcp: Add TCP-AO sign to twsk
Add support for sockets in time-wait state. ao_info as well as all keys are inherited on transition to time-wait socket. The lifetime of ao_info is now protected by
net/tcp: Add TCP-AO sign to twsk
Add support for sockets in time-wait state. ao_info as well as all keys are inherited on transition to time-wait socket. The lifetime of ao_info is now protected by ref counter, so that tcp_ao_destroy_sock() will destruct it only when the last user is gone.
Co-developed-by: Francesco Ruggeri <[email protected]> Signed-off-by: Francesco Ruggeri <[email protected]> Co-developed-by: Salam Noureddine <[email protected]> Signed-off-by: Salam Noureddine <[email protected]> Signed-off-by: Dmitry Safonov <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
c845f5f3 |
| 23-Oct-2023 |
Dmitry Safonov <[email protected]> |
net/tcp: Add TCP-AO config and structures
Introduce new kernel config option and common structures as well as helpers to be used by TCP-AO code.
Co-developed-by: Francesco Ruggeri <fruggeri@arista.
net/tcp: Add TCP-AO config and structures
Introduce new kernel config option and common structures as well as helpers to be used by TCP-AO code.
Co-developed-by: Francesco Ruggeri <[email protected]> Signed-off-by: Francesco Ruggeri <[email protected]> Co-developed-by: Salam Noureddine <[email protected]> Signed-off-by: Salam Noureddine <[email protected]> Signed-off-by: Dmitry Safonov <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7 |
|
| #
614e8316 |
| 20-Oct-2023 |
Eric Dumazet <[email protected]> |
tcp: add support for usec resolution in TCP TS values
Back in 2015, Van Jacobson suggested to use usec resolution in TCP TS values. This has been implemented in our private kernels.
Goals were :
1
tcp: add support for usec resolution in TCP TS values
Back in 2015, Van Jacobson suggested to use usec resolution in TCP TS values. This has been implemented in our private kernels.
Goals were :
1) better observability of delays in networking stacks. 2) better disambiguation of events based on TSval/ecr values. 3) building block for congestion control modules needing usec resolution.
Back then we implemented a schem based on private SYN options to negotiate the feature.
For upstream submission, we chose to use a route attribute, because this feature is probably going to be used in private networks [1] [2].
ip route add 10/8 ... features tcp_usec_ts
Note that RFC 7323 recommends a "timestamp clock frequency in the range 1 ms to 1 sec per tick.", but also mentions "the maximum acceptable clock frequency is one tick every 59 ns."
[1] Unfortunately RFC 7323 5.5 (Outdated Timestamps) suggests to invalidate TS.Recent values after a flow was idle for more than 24 days. This is the part making usec_ts a problem for peers following this recommendation for long living idle flows.
[2] Attempts to standardize usec ts went nowhere:
https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf https://datatracker.ietf.org/doc/draft-wang-tcpm-low-latency-opt/
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
3d44de9a |
| 20-Oct-2023 |
Eric Dumazet <[email protected]> |
tcp: add RTAX_FEATURE_TCP_USEC_TS
This new dst feature flag will be used to allow TCP to use usec based timestamps instead of msec ones.
ip route .... feature tcp_usec_ts
Also document that RTAX_F
tcp: add RTAX_FEATURE_TCP_USEC_TS
This new dst feature flag will be used to allow TCP to use usec based timestamps instead of msec ones.
ip route .... feature tcp_usec_ts
Also document that RTAX_FEATURE_SACK and RTAX_FEATURE_TIMESTAMP are unused.
RTAX_FEATURE_ALLFRAG is also going away soon.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
3868ab0f |
| 14-Sep-2023 |
Aananth V <[email protected]> |
tcp: new TCP_INFO stats for RTO events
The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can effectively reduce
tcp: new TCP_INFO stats for RTO events
The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can effectively reduce application disruption during outages. To better measure the efficacy of this feature, this patch adds three more detailed stats during RTO recovery and exports via TCP_INFO. Applications and monitoring systems can leverage this data to measure the network path diversity and end-to-end repair latency during network outages to improve their network infrastructure.
The following counters are added to tcp_sock in order to track RTO events over the lifetime of a TCP socket.
1. u16 total_rto - Counts the total number of RTO timeouts. 2. u16 total_rto_recoveries - Counts the total number of RTO recoveries. 3. u32 total_rto_time - Counts the total time spent (ms) in RTO recoveries. (time spent in CA_Loss and CA_Recovery states)
To compute total_rto_time, we add a new u32 rto_stamp field to tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO recovery (CA_Loss).
Corresponding fields are also added to the tcp_info struct.
Signed-off-by: Aananth V <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
133c4c0d |
| 11-Sep-2023 |
Eric Dumazet <[email protected]> |
tcp: defer regular ACK while processing socket backlog
This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk tr
tcp: defer regular ACK while processing socket backlog
This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk transfers.
For high throughput flows, it is best to use one cpu running the user thread issuing socket system calls, and a separate cpu to process incoming packets from BH context. (With TSO/GRO, bottleneck is usually the 'user' cpu)
Problem is the user thread can spend a lot of time while holding the socket lock, forcing BH handler to queue most of incoming packets in the socket backlog.
Whenever the user thread releases the socket lock, it must first process all accumulated packets in the backlog, potentially adding latency spikes. Due to flood mitigation, having too many packets in the backlog increases chance of unexpected drops.
Backlog processing unfortunately shifts a fair amount of cpu cycles from the BH cpu to the 'user' cpu, thus reducing max throughput.
This patch takes advantage of the backlog processing, and the fact that ACK are mostly cumulative.
The idea is to detect we are in the backlog processing and defer all eligible ACK into a single one, sent from tcp_release_cb().
This saves cpu cycles on both sides, and network resources.
Performance of a single TCP flow on a 200Gbit NIC:
- Throughput is increased by 20% (100Gbit -> 120Gbit). - Number of generated ACK per second shrinks from 240,000 to 40,000. - Number of backlog drops per second shrinks from 230 to 0.
Benchmark context: - Regular netperf TCP_STREAM (no zerocopy) - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids) - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
This feature is guarded by a new sysctl, and enabled by default: /proc/sys/net/ipv4/tcp_backlog_ack_defer
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Dave Taht <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5 |
|
| #
d58f2e15 |
| 04-Aug-2023 |
Eric Dumazet <[email protected]> |
tcp: set TCP_USER_TIMEOUT locklessly
icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE().
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yega
tcp: set TCP_USER_TIMEOUT locklessly
icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE().
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc4, v6.5-rc3 |
|
| #
70f360dd |
| 19-Jul-2023 |
Eric Dumazet <[email protected]> |
tcp: annotate data-races around fastopenq.max_qlen
This field can be read locklessly.
Fixes: 1536e2857bd3 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner") Signed-off-by
tcp: annotate data-races around fastopenq.max_qlen
This field can be read locklessly.
Fixes: 1536e2857bd3 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner") Signed-off-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
dfa2f048 |
| 17-Jul-2023 |
Eric Dumazet <[email protected]> |
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3 |
|
| #
e9d9da91 |
| 17-Mar-2023 |
Eric Dumazet <[email protected]> |
tcp: preserve const qualifier in tcp_sk()
We can change tcp_sk() to propagate its argument const qualifier, thanks to container_of_const().
We have two places where a const sock pointer has to be u
tcp: preserve const qualifier in tcp_sk()
We can change tcp_sk() to propagate its argument const qualifier, thanks to container_of_const().
We have two places where a const sock pointer has to be upgraded to a write one. We have been using const qualifier for lockless listeners to clearly identify points where writes could happen.
Add tcp_sk_rw() helper to better document these.
tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check() and tcp_rack_reo_wnd() get an additional const qualififer for their @tp local variables.
smc_check_reset_syn_req() also needs a similar change.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3 |
|
| #
29c1c446 |
| 26-Oct-2022 |
Mubashir Adnan Qureshi <[email protected]> |
tcp: add u32 counter in tcp_sock and an SNMP counter for PLB
A u32 counter is added to tcp_sock for counting the number of PLB triggered rehashes for a TCP connection. An SNMP counter is also added
tcp: add u32 counter in tcp_sock and an SNMP counter for PLB
A u32 counter is added to tcp_sock for counting the number of PLB triggered rehashes for a TCP connection. An SNMP counter is also added to count overall PLB triggered rehash events for a host. These counters are hooked up to PLB implementation for DCTCP.
TCP_NLA_REHASH is added to SCM_TIMESTAMPING_OPT_STATS that reports the rehash attempts triggered due to PLB or timeouts. This gives a historical view of sustained congestion or timeouts experienced by the TCP connection.
Signed-off-by: Mubashir Adnan Qureshi <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc2, v6.1-rc1, v6.0 |
|
| #
f4ce91ce |
| 28-Sep-2022 |
Neal Cardwell <[email protected]> |
tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that
tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number.
(b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number.
Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for the current window.
(2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Kevin(Yudong) Yang <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
061ff040 |
| 29-Sep-2022 |
Martin KaFai Lau <[email protected]> |
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
.init => bpf_setsock
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
.init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ... ... => .init => bpf_setsockopt(tcp_cc).
It was prevented by the prog->active counter before but the prog->active detection cannot be used in struct_ops as explained in the earlier patch of the set.
In this patch, the second bpf_setsockopt(tcp_cc) is not allowed in order to break the loop. This is done by using a bit of an existing 1 byte hole in tcp_sock to check if there is on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
Note that this essentially limits only the first '.init' can call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer does not support ECN) and the second '.init' cannot fallback to another cc. This applies even the second bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|