|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
f278b6d5 |
| 31-Mar-2025 |
Eric Dumazet <[email protected]> |
Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
This reverts commit 0de2a5c4b824da2205658ebebb99a55c43cdf60f.
I forgot that a TCP socket could receive messages in its error queue.
sock_
Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
This reverts commit 0de2a5c4b824da2205658ebebb99a55c43cdf60f.
I forgot that a TCP socket could receive messages in its error queue.
sock_queue_err_skb() can be called without socket lock being held, and changes sk->sk_rmem_alloc.
The fact that skbs in error queue are limited by sk->sk_rcvbuf means that error messages can be dropped if socket receive queues are full, which is an orthogonal issue.
In future kernels, we could use a separate sk->sk_error_mem_alloc counter specifically for the error queue.
Fixes: 0de2a5c4b824 ("tcp: avoid atomic operations on sk->sk_rmem_alloc") Signed-off-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14 |
|
| #
0de2a5c4 |
| 20-Mar-2025 |
Eric Dumazet <[email protected]> |
tcp: avoid atomic operations on sk->sk_rmem_alloc
TCP uses generic skb_set_owner_r() and sock_rfree() for received packets, with socket lock being owned.
Switch to private versions, avoiding two at
tcp: avoid atomic operations on sk->sk_rmem_alloc
TCP uses generic skb_set_owner_r() and sock_rfree() for received packets, with socket lock being owned.
Switch to private versions, avoiding two atomic operations per packet.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
1937a0be |
| 17-Mar-2025 |
Eric Dumazet <[email protected]> |
tcp: move icsk_clean_acked to a better location
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack().
tcp: move icsk_clean_acked to a better location
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack().
Rename it to tcp_clean_acked, move it to tcp_sock structure in the tcp_sock_read_rx for better cache locality in TCP fast path.
Define this field only when CONFIG_TLS_DEVICE is enabled saving 8 bytes on configs not using it.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
15492700 |
| 12-Mar-2025 |
Eric Dumazet <[email protected]> |
tcp: cache RTAX_QUICKACK metric in a hot cache line
tcp_in_quickack_mode() is called from input path for small packets.
It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in so
tcp: cache RTAX_QUICKACK metric in a hot cache line
tcp_in_quickack_mode() is called from input path for small packets.
It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in sock_read_tx group (for good reasons).
Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses.
Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull these cache lines for the cases a delayed ACK is scheduled.
After this patch TCP receive path does not longer access sock_read_tx group.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
041fb11d |
| 05-Mar-2025 |
Ilpo Järvinen <[email protected]> |
tcp: helpers for ECN mode handling
Create helpers for TCP ECN modes. No functional changes.
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.
tcp: helpers for ECN mode handling
Create helpers for TCP ECN modes. No functional changes.
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: Chia-Yu Chang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
f0db2bca |
| 05-Mar-2025 |
Ilpo Järvinen <[email protected]> |
tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is called only for data segments, not for ACKs (with AccECN, also ACKs may get ECN bit
tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is called only for data segments, not for ACKs (with AccECN, also ACKs may get ECN bits).
The extra "layer" in tcp_ecn_check_ce() function just checks for ECN being enabled, that can be moved into tcp_ecn_field_check rather than having the __ variant.
No functional changes.
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: Chia-Yu Chang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
da610e18 |
| 05-Mar-2025 |
Ilpo Järvinen <[email protected]> |
tcp: create FLAG_TS_PROGRESS
Whenever timestamp advances, it declares progress which can be used by the other parts of the stack to decide that the ACK is the most recent one seen so far.
AccECN wi
tcp: create FLAG_TS_PROGRESS
Whenever timestamp advances, it declares progress which can be used by the other parts of the stack to decide that the ACK is the most recent one seen so far.
AccECN will use this flag when deciding whether to use the ACK to update AccECN state or not.
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: Chia-Yu Chang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
149dfb31 |
| 05-Mar-2025 |
Ilpo Järvinen <[email protected]> |
tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce() out of it - Move tcp_in_ack_event() later - While at it, remove
tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce() out of it - Move tcp_in_ack_event() later - While at it, remove the inline from tcp_in_ack_event() and let the compiler to decide
Accurate ECN's heuristics does not know if there is going to be ACE field based CE counter increase or not until after rtx queue has been processed. Only then the number of ACKed bytes/pkts is available. As CE or not affects presence of FLAG_ECE, that information for tcp_in_ack_event is not yet available in the old location of the call to tcp_in_ack_event().
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: Chia-Yu Chang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
e34100c2 |
| 01-Mar-2025 |
Eric Dumazet <[email protected]> |
tcp: add a drop_reason pointer to tcp_check_req()
We want to add new drop reasons for packets dropped in 3WHS in the following patches.
tcp_rcv_state_process() has to set reason to TCP_FASTOPEN, be
tcp: add a drop_reason pointer to tcp_check_req()
We want to add new drop reasons for packets dropped in 3WHS in the following patches.
tcp_rcv_state_process() has to set reason to TCP_FASTOPEN, because tcp_check_req() will conditionally overwrite the drop_reason.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
3ba07527 |
| 25-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: be less liberal in TSEcr received while in SYN_RECV state
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323)
As hinted by an old com
tcp: be less liberal in TSEcr received while in SYN_RECV state
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323)
As hinted by an old comment in tcp_check_req(), we can check the TSEcr value in the incoming packet corresponds to one of the SYNACK TSval values we have sent.
In this patch, I record the oldest and most recent values that SYNACK packets have used.
Send a challenge ACK if we receive a TSEcr outside of this range, and increase a new SNMP counter.
nstat -az | grep TSEcrRejected TcpExtTSEcrRejected 0 0.0
Due to TCP fastopen implementation, do not apply yet these checks for fastopen flows.
v2: No longer use req->num_timeout, but treq->snt_tsval_first to detect when first SYNACK is prepared. This means we make sure to not send an initial zero TSval. Make sure MPTCP and TCP selftests are passing. Change MIB name to TcpExtTSEcrRejected
v1: https://lore.kernel.org/netdev/CADVnQykD8i4ArpSZaPKaoNxLJ2if2ts9m4As+=Jvdkrgx1qMHw@mail.gmail.com/T/
Reported-by: Yong-Hao Zou <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
fd93eaff |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will call the sockops BPF program without holding the sock
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will call the sockops BPF program without holding the sock lock.
This breaks the current assumption that all sock ops programs will hold the sock lock. The sock's fields of the uapi's bpf_sock_ops requires this assumption.
To address this, a new "u8 is_locked_tcp_sock;" field is added. This patch sets it in the current sock_ops callbacks. The "is_fullsock" test is then replaced by the "is_locked_tcp_sock" test during sock_ops_convert_ctx_access().
The new TX timestamping callbacks added in the subsequent patch will not have this set. This will prevent unsafe access from the new timestamping callbacks.
Potentially, we could allow read-only access. However, this would require identifying which callback is read-safe-only and also requires additional BPF instruction rewrites in the covert_ctx. Since the BPF program can always read everything from a socket (e.g., by using bpf_core_cast), this patch keeps it simple and disables all read and write access to any socket fields through the bpf_sock_ops UAPI from the new TX timestamping callback.
Moreover, note that some of the fields in bpf_sock_ops are specific to tcp_sock, and sock_ops currently only supports tcp_sock. In the future, UDP timestamping will be added, which will also break this assumption. The same idea used in this patch will be reused. Considering that the current sock_ops only supports tcp_sock, the variable is named is_locked_"tcp"_sock.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
9b6412e6 |
| 17-Feb-2025 |
Sabrina Dubroca <[email protected]> |
tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while running tests that boil down to: - create a pair of netns - run a basic
tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while running tests that boil down to: - create a pair of netns - run a basic TCP test over ipcomp6 - delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we delete the netns, because we still have a reference on it. This lingering reference comes from a secpath (which holds a ref on the xfrm_state), which is still attached to an skb. This skb is not leaked, it ends up on sk_receive_queue and then gets defer-free'd by skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's defer_list), and don't flush that list before the netns is deleted. In that case, we still have a reference on the xfrm_state that we don't expect at this point.
We already drop the skb's dst in the TCP receive path when it's no longer needed, so let's also drop the secpath. At this point, tcp_filter has already called into the LSM hooks that may require the secpath, so it should not be needed anymore. However, in some of those places, the MPTCP extension has just been attached to the skb, so we cannot simply drop all extensions.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Reported-by: Xiumei Mu <[email protected]> Signed-off-by: Sabrina Dubroca <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
f5da7c45 |
| 17-Feb-2025 |
Jakub Kicinski <[email protected]> |
tcp: adjust rcvq_space after updating scaling ratio
Since commit under Fixes we set the window clamp in accordance to newly measured rcvbuf scaling_ratio. If the scaling_ratio decreased significantl
tcp: adjust rcvq_space after updating scaling ratio
Since commit under Fixes we set the window clamp in accordance to newly measured rcvbuf scaling_ratio. If the scaling_ratio decreased significantly we may put ourselves in a situation where windows become smaller than rcvq_space, preventing tcp_rcv_space_adjust() from increasing rcvbuf.
The significant decrease of scaling_ratio is far more likely since commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"), which increased the "default" scaling ratio from ~30% to 50%.
Hitting the bad condition depends a lot on TCP tuning, and drivers at play. One of Meta's workloads hits it reliably under following conditions: - default rcvbuf of 125k - sender MTU 1500, receiver MTU 5000 - driver settles on scaling_ratio of 78 for the config above. Initial rcvq_space gets calculated as TCP_INIT_CWND * tp->advmss (10 * 5k = 50k). Once we find out the true scaling ratio and MSS we clamp the windows to 38k. Triggering the condition also depends on the message sequence of this workload. I can't repro the problem with simple iperf or TCP_RR-style tests.
Fixes: a2cbb1603943 ("tcp: Update window clamping condition") Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
8e677a46 |
| 14-Feb-2025 |
Breno Leitao <[email protected]> |
trace: tcp: Add tracepoint for tcp_cwnd_reduction()
Add a lightweight tracepoint to monitor TCP congestion window adjustments via tcp_cwnd_reduction(). This tracepoint enables tracking of: - TCP win
trace: tcp: Add tracepoint for tcp_cwnd_reduction()
Add a lightweight tracepoint to monitor TCP congestion window adjustments via tcp_cwnd_reduction(). This tracepoint enables tracking of: - TCP window size fluctuations - Active socket behavior - Congestion window reduction events
Meta has been using BPF programs to monitor this function for years. Adding a proper tracepoint provides a stable API for all users who need to monitor TCP congestion window behavior.
Use DECLARE_TRACE instead of TRACE_EVENT to avoid creating trace event infrastructure and exporting to tracefs, keeping the implementation minimal. (Thanks Steven Rostedt)
Given that this patch creates a rawtracepoint, you could hook into it using regular tooling, like bpftrace, using regular rawtracepoint infrastructure, such as:
rawtracepoint:tcp_cwnd_reduction_tp { .... }
Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
6dc4c252 |
| 12-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: use EXPORT_IPV6_MOD[_GPL]()
Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need to be exported unless CONFIG_IPV6=m
tcp_hashinfo and tcp_openreq_init_rwin() are no longer used from any mod
tcp: use EXPORT_IPV6_MOD[_GPL]()
Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need to be exported unless CONFIG_IPV6=m
tcp_hashinfo and tcp_openreq_init_rwin() are no longer used from any module anyway.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Mateusz Polchlopek <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
54a378f4 |
| 07-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: add the ability to control max RTO
Currently, TCP stack uses a constant (120 seconds) to limit the RTO value exponential growth.
Some applications want to set a lower value.
Add TCP_RTO_MAX_M
tcp: add the ability to control max RTO
Currently, TCP stack uses a constant (120 seconds) to limit the RTO value exponential growth.
Some applications want to set a lower value.
Add TCP_RTO_MAX_MS socket option to set a value (in ms) between 1 and 120 seconds.
It is discouraged to change the socket rto max on a live socket, as it might lead to unexpected disconnects.
Following patch is adding a netns sysctl to control the default value at socket creation time.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
48b69b4c |
| 07-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: use tcp_reset_xmit_timer()
In order to reduce TCP_RTO_MAX occurrences, replace:
inet_csk_reset_xmit_timer(sk, what, when, TCP_RTO_MAX)
With:
tcp_reset_xmit_timer(sk, what, when, fals
tcp: use tcp_reset_xmit_timer()
In order to reduce TCP_RTO_MAX occurrences, replace:
inet_csk_reset_xmit_timer(sk, what, when, TCP_RTO_MAX)
With:
tcp_reset_xmit_timer(sk, what, when, false);
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
7baa0301 |
| 07-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: add a @pace_delay parameter to tcp_reset_xmit_timer()
We want to factorize calls to inet_csk_reset_xmit_timer(), to ease TCP_RTO_MAX change.
Current users want to add tcp_pacing_delay(sk) to t
tcp: add a @pace_delay parameter to tcp_reset_xmit_timer()
We want to factorize calls to inet_csk_reset_xmit_timer(), to ease TCP_RTO_MAX change.
Current users want to add tcp_pacing_delay(sk) to the timeout.
Remaining calls to inet_csk_reset_xmit_timer() do not add the pacing delay. Following patch will convert them, passing false for @pace_delay.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
0fed4637 |
| 07-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: remove tcp_reset_xmit_timer() @max_when argument
All callers use TCP_RTO_MAX, we can factorize this constant, becoming a variable soon.
Signed-off-by: Eric Dumazet <[email protected]> Review
tcp: remove tcp_reset_xmit_timer() @max_when argument
All callers use TCP_RTO_MAX, we can factorize this constant, becoming a variable soon.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
be258f65 |
| 06-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: rename inet_csk_{delete|reset}_keepalive_timer()
inet_csk_delete_keepalive_timer() and inet_csk_reset_keepalive_timer() are only used from core TCP, there is no need to export them.
Replace th
tcp: rename inet_csk_{delete|reset}_keepalive_timer()
inet_csk_delete_keepalive_timer() and inet_csk_reset_keepalive_timer() are only used from core TCP, there is no need to export them.
Replace their prefix by tcp.
Move them to net/ipv4/tcp_timer.c and make tcp_delete_keepalive_timer() static.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Joe Damato <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
d876ec8d |
| 06-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: do not export tcp_parse_mss_option() and tcp_mtup_init()
These two functions are not called from modules.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <horms@ker
tcp: do not export tcp_parse_mss_option() and tcp_mtup_init()
These two functions are not called from modules.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Joe Damato <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13 |
|
| #
d16b3447 |
| 13-Jan-2025 |
Eric Dumazet <[email protected]> |
tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason.
This patch adds the corresponding SNMP counter, for folks using nstat instead of traci
tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason.
This patch adds the corresponding SNMP counter, for folks using nstat instead of tracing for TCP diagnostics.
nstat -az | grep PAWSOldAck
Suggested-by: Neal Cardwell <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
124c4c32 |
| 13-Jan-2025 |
Eric Dumazet <[email protected]> |
tcp: add TCP_RFC7323_PAWS_ACK drop reason
XPS can cause reorders because of the relaxed OOO conditions for pure ACK packets.
For hosts not using RFS, what can happpen is that ACK packets are sent o
tcp: add TCP_RFC7323_PAWS_ACK drop reason
XPS can cause reorders because of the relaxed OOO conditions for pure ACK packets.
For hosts not using RFS, what can happpen is that ACK packets are sent on behalf of the cpu processing NIC interrupts, selecting TX queue A for ACK packet P1.
Then a subsequent sendmsg() can run on another cpu. TX queue selection uses the socket hash and can choose another queue B for packets P2 (with payload).
If queue A is more congested than queue B, the ACK packet P1 could be sent on the wire after P2.
A linux receiver when processing P1 (after P2) currently increments LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab) and use TCP_RFC7323_PAWS drop reason. It might also send a DUPACK if not rate limited.
In order to better understand this pattern, this patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK.
For old ACKS like these, we no longer increment LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK, keeping credit for other more interesting DUPACK.
perf record -e skb:kfree_skb -a perf script ... swapper 0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK swapper 0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK ...
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
ea98b61b |
| 13-Jan-2025 |
Eric Dumazet <[email protected]> |
tcp: add drop_reason support to tcp_disordered_ack()
Following patch is adding a new drop_reason to tcp_validate_incoming().
Change tcp_disordered_ack() to not return a boolean anymore, but a drop
tcp: add drop_reason support to tcp_disordered_ack()
Following patch is adding a new drop_reason to tcp_validate_incoming().
Change tcp_disordered_ack() to not return a boolean anymore, but a drop reason.
Change its name to tcp_disordered_ack_check()
Refactor tcp_validate_incoming() to ease the code review of the following patch, and reduce indentation level.
This patch is a refactor, with no functional change.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
4f4aa4aa |
| 19-Dec-2024 |
Wang Liang <[email protected]> |
net: fix memory leak in tcp_conn_request()
If inet_csk_reqsk_queue_hash_add() return false, tcp_conn_request() will return without free the dst memory, which allocated in af_ops->route_req.
Here is
net: fix memory leak in tcp_conn_request()
If inet_csk_reqsk_queue_hash_add() return false, tcp_conn_request() will return without free the dst memory, which allocated in af_ops->route_req.
Here is the kmemleak stack:
unreferenced object 0xffff8881198631c0 (size 240): comm "softirq", pid 0, jiffies 4299266571 (age 1802.392s) hex dump (first 32 bytes): 00 10 9b 03 81 88 ff ff 80 98 da bc ff ff ff ff ................ 81 55 18 bb ff ff ff ff 00 00 00 00 00 00 00 00 .U.............. backtrace: [<ffffffffb93e8d4c>] kmem_cache_alloc+0x60c/0xa80 [<ffffffffba11b4c5>] dst_alloc+0x55/0x250 [<ffffffffba227bf6>] rt_dst_alloc+0x46/0x1d0 [<ffffffffba23050a>] __mkroute_output+0x29a/0xa50 [<ffffffffba23456b>] ip_route_output_key_hash+0x10b/0x240 [<ffffffffba2346bd>] ip_route_output_flow+0x1d/0x90 [<ffffffffba254855>] inet_csk_route_req+0x2c5/0x500 [<ffffffffba26b331>] tcp_conn_request+0x691/0x12c0 [<ffffffffba27bd08>] tcp_rcv_state_process+0x3c8/0x11b0 [<ffffffffba2965c6>] tcp_v4_do_rcv+0x156/0x3b0 [<ffffffffba299c98>] tcp_v4_rcv+0x1cf8/0x1d80 [<ffffffffba239656>] ip_protocol_deliver_rcu+0xf6/0x360 [<ffffffffba2399a6>] ip_local_deliver_finish+0xe6/0x1e0 [<ffffffffba239b8e>] ip_local_deliver+0xee/0x360 [<ffffffffba239ead>] ip_rcv+0xad/0x2f0 [<ffffffffba110943>] __netif_receive_skb_one_core+0x123/0x140
Call dst_release() to free the dst memory when inet_csk_reqsk_queue_hash_add() return false in tcp_conn_request().
Fixes: ff46e3b44219 ("Fix race for duplicate reqsk on identical SYN") Signed-off-by: Wang Liang <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|