|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
f38805c5 |
| 17-Mar-2025 |
Jason Xing <[email protected]> |
tcp: support TCP_RTO_MIN_US for set/getsockopt use
Support adjusting/reading RTO MIN for socket level by using set/getsockopt().
This new option has the same effect as TCP_BPF_RTO_MIN, which means
tcp: support TCP_RTO_MIN_US for set/getsockopt use
Support adjusting/reading RTO MIN for socket level by using set/getsockopt().
This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use.
When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit.
This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
1280c262 |
| 07-Feb-2025 |
Eric Dumazet <[email protected]> |
tcp: add tcp_rto_max_ms sysctl
Previous patch added a TCP_RTO_MAX_MS socket option to tune a TCP socket max RTO value.
Many setups prefer to change a per netns sysctl.
This patch adds /proc/sys/ne
tcp: add tcp_rto_max_ms sysctl
Previous patch added a TCP_RTO_MAX_MS socket option to tune a TCP socket max RTO value.
Many setups prefer to change a per netns sysctl.
This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms
Its initial value is 120000 (120 seconds).
Keep in mind that a decrease of tcp_rto_max_ms means shorter overall timeouts, unless tcp_retries2 sysctl is increased.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
ca6a6f93 |
| 09-Dec-2024 |
Jakub Sitnicki <[email protected]> |
tcp: Add sysctl to configure TIME-WAIT reuse delay
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be reused by reopening a connection. This is a safe choice based on an assum
tcp: Add sysctl to configure TIME-WAIT reuse delay
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be reused by reopening a connection. This is a safe choice based on an assumption that the other TCP timestamp clock frequency, which is unknown to us, may be as low as 1 Hz (RFC 7323, section 5.4).
However, this means that in the presence of short lived connections with an RTT of couple of milliseconds, the time during which a 4-tuple is blocked from reuse can be orders of magnitude longer that the connection lifetime. Combined with a reduced pool of ephemeral ports, when using IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the long TIME-WAIT reuse delay can lead to port exhaustion, where all available 4-tuples are tied up in TIME-WAIT state.
Turn the reuse delay into a per-netns setting so that sysadmins can make more aggressive assumptions about remote TCP timestamp clock frequency and shorten the delay in order to allow connections to reincarnate faster.
Note that applications can completely bypass the TIME-WAIT delay protection already today by locking the local port with bind() before connecting. Such immediate connection reuse may result in PAWS failing to detect old duplicate segments, leaving us with just the sequence number check as a safety net.
This new configurable offers a trade off where the sysadmin can balance between the risk of PAWS detection failing to act versus exhausting ports by having sockets tied up in TIME-WAIT state for too long.
[1] https://lpc.events/event/16/contributions/1349/
Signed-off-by: Jakub Sitnicki <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
bbe4b412 |
| 09-Dec-2024 |
Petr Machata <[email protected]> |
Documentation: networking: Add a caveat to nexthop_compat_mode sysctl
net.ipv4.nexthop_compat_mode was added when nexthop objects were added to provide the view of nexthop objects through the usual
Documentation: networking: Add a caveat to nexthop_compat_mode sysctl
net.ipv4.nexthop_compat_mode was added when nexthop objects were added to provide the view of nexthop objects through the usual lens of the route UAPI. As nexthop objects evolved, the information provided through this lens became incomplete. For example, details of resilient nexthop groups are obviously omitted.
Now that 16-bit nexthop group weights are a thing, the 8-bit UAPI cannot convey the >8-bit weight accurately. Instead of inventing workarounds for an obsolete interface, just document the expectations of inaccuracy.
Fixes: b72a6a7ab957 ("net: nexthop: Increase weight to u16") Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/b575e32399ccacd09079b2a218255164535123bd.1733740749.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
990c3049 |
| 29-Jul-2024 |
Patrick Rohr <[email protected]> |
Add support for PIO p flag
draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information Option to signal that the network can allocate a unique IPv6 prefix per client via DHCPv6-PD (see
Add support for PIO p flag
draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information Option to signal that the network can allocate a unique IPv6 prefix per client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).
When ra_honor_pio_pflag is enabled, the presence of a P-flag causes SLAAC autoconfiguration to be disabled for that particular PIO.
An automated test has been added in Android (r.android.com/3195335) to go along with this change.
Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Cc: David Lamparter <[email protected]> Cc: Simon Horman <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3 |
|
| #
4ee2a8ca |
| 07-Jun-2024 |
Petr Machata <[email protected]> |
net: ipv4: Add a sysctl to set multipath hash seed
When calculating hashes for the purpose of multipath forwarding, both IPv4 and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
net: ipv4: Add a sysctl to set multipath hash seed
When calculating hashes for the purpose of multipath forwarding, both IPv4 and IPv6 code currently fall back on flow_hash_from_keys(). That uses a randomly-generated seed. That's a fine choice by default, but unfortunately some deployments may need a tighter control over the seed used.
In this patch, make the seed configurable by adding a new sysctl key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used specifically for multipath forwarding and not for the other concerns that flow_hash_from_keys() is used for, such as queue selection. Expose the knob as sysctl because other such settings, such as headers to hash, are also handled that way. Like those, the multipath hash seed is a per-netns variable.
Despite being placed in the net.ipv4 namespace, the multipath seed sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of TCP variables.
The seed used by flow_hash_from_keys() is a 128-bit quantity. However it seems that usually the seed is a much more modest value. 32 bits seem typical (Cisco, Cumulus), some systems go even lower. For that reason, and to decouple the user interface from implementation details, go with a 32-bit quantity, which is then quadruplicated to form the siphash key.
Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
f086edef |
| 03-Jun-2024 |
Kevin Yang <[email protected]> |
tcp: add sysctl_tcp_rto_min_us
Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min.
Note that the rto_min rout
tcp: add sysctl_tcp_rto_min_us
Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min.
Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN socket option, followed by the tcp_rto_min_us sysctl.
Signed-off-by: Kevin Yang <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Yuchung Cheng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Tony Lu <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5 |
|
| #
f4bcbf36 |
| 14-Feb-2024 |
Alex Henrie <[email protected]> |
net: ipv6/addrconf: clamp preferred_lft to the minimum required
If the preferred lifetime was less than the minimum required lifetime, ipv6_create_tempaddr would error out without creating any new a
net: ipv6/addrconf: clamp preferred_lft to the minimum required
If the preferred lifetime was less than the minimum required lifetime, ipv6_create_tempaddr would error out without creating any new address. On my machine and network, this error happened immediately with the preferred lifetime set to 5 seconds or less, after a few minutes with the preferred lifetime set to 6 seconds, and not at all with the preferred lifetime set to 7 seconds. During my investigation, I found a Stack Exchange post from another person who seems to have had the same problem: They stopped getting new addresses if they lowered the preferred lifetime below 3 seconds, and they didn't really know why.
The preferred lifetime is a preference, not a hard requirement. The kernel does not strictly forbid new connections on a deprecated address, nor does it guarantee that the address will be disposed of the instant its total valid lifetime expires. So rather than disable IPv6 privacy extensions altogether if the minimum required lifetime swells above the preferred lifetime, it is more in keeping with the user's intent to increase the temporary address's lifetime to the minimum necessary for the current network conditions.
With these fixes, setting the preferred lifetime to 5 or 6 seconds "just works" because the extra fraction of a second is practically unnoticeable. It's even possible to reduce the time before deprecation to 1 or 2 seconds by setting /proc/sys/net/ipv6/conf/*/regen_min_advance and /proc/sys/net/ipv6/conf/*/dad_transmits to 0. I realize that that is a pretty niche use case, but I know at least one person who would gladly sacrifice performance and convenience to be sure that they are getting the maximum possible level of privacy.
Link: https://serverfault.com/a/1031168/310447 Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
a5fcea2d |
| 14-Feb-2024 |
Alex Henrie <[email protected]> |
net: ipv6/addrconf: introduce a regen_min_advance sysctl
In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC does not permit the creation of temporary addresses with lifetimes shor
net: ipv6/addrconf: introduce a regen_min_advance sysctl
In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC does not permit the creation of temporary addresses with lifetimes shorter than that:
> When processing a Router Advertisement with a > Prefix Information option carrying a prefix for the purposes of > address autoconfiguration (i.e., the A bit is set), the host MUST > perform the following steps:
> 5. A temporary address is created only if this calculated preferred > lifetime is greater than REGEN_ADVANCE time units.
However, some users want to change their IPv6 address as frequently as possible regardless of the RFC's arbitrary minimum lifetime. For the benefit of those users, add a regen_min_advance sysctl parameter that can be set to below or above 2 seconds.
Link: https://datatracker.ietf.org/doc/html/rfc8981 Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
2aa8f155 |
| 14-Feb-2024 |
Alex Henrie <[email protected]> |
net: ipv6/addrconf: ensure that regen_advance is at least 2 seconds
RFC 8981 defines REGEN_ADVANCE as follows:
REGEN_ADVANCE = 2 + (TEMP_IDGEN_RETRIES * DupAddrDetectTransmits * RetransTimer / 1000
net: ipv6/addrconf: ensure that regen_advance is at least 2 seconds
RFC 8981 defines REGEN_ADVANCE as follows:
REGEN_ADVANCE = 2 + (TEMP_IDGEN_RETRIES * DupAddrDetectTransmits * RetransTimer / 1000)
Thus, allowing it to be less than 2 seconds is technically a protocol violation.
Link: https://datatracker.ietf.org/doc/html/rfc8981#name-defined-protocol-parameters Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8 |
|
| #
8cdafdd9 |
| 30-Dec-2023 |
Alex Henrie <[email protected]> |
Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum required"
The commit had a bug and might not have been the right approach anyway.
Fixes: 629df6701c8a ("net: ipv6/addrconf: clamp pref
Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum required"
The commit had a bug and might not have been the right approach anyway.
Fixes: 629df6701c8a ("net: ipv6/addrconf: clamp preferred_lft to the minimum required") Fixes: ec575f885e3e ("Documentation: networking: explain what happens if temp_prefered_lft is too small or too large") Reported-by: Dan Moulding <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/CAMMLpeTdYhd=7hhPi2Y7pwdPCgnnW5JYh-bu3hSc7im39uxnEA@mail.gmail.com/ Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
ec575f88 |
| 24-Oct-2023 |
Alex Henrie <[email protected]> |
Documentation: networking: explain what happens if temp_prefered_lft is too small or too large
Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link:
Documentation: networking: explain what happens if temp_prefered_lft is too small or too large
Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
433d6c80 |
| 24-Oct-2023 |
Alex Henrie <[email protected]> |
Documentation: networking: explain what happens if temp_valid_lft is too small
Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.ke
Documentation: networking: explain what happens if temp_valid_lft is too small
Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6 |
|
| #
562b1fdf |
| 11-Oct-2023 |
Haiyang Zhang <[email protected]> |
tcp: Set pingpong threshold via sysctl
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode fo
tcp: Set pingpong threshold via sysctl
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance.
The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")
There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs.
Signed-off-by: Haiyang Zhang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4 |
|
| #
473267a4 |
| 25-Sep-2023 |
Patrick Rohr <[email protected]> |
net: add sysctl to disable rfc4862 5.5.3e lifetime handling
This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid lifetime derivation mechanism.
RFC4862 section 5.5.3e prescribes t
net: add sysctl to disable rfc4862 5.5.3e lifetime handling
This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid lifetime derivation mechanism.
RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router Advertisement PIO shall be ignored if it less than 2 hours and to reset the lifetime of the corresponding address to 2 hours. An in-progress 6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently looking to remove this mechanism. While this draft has not been moving particularly quickly for other reasons, there is widespread consensus on section 4.2 which updates RFC4862 section 5.5.3e.
Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Cc: Jen Linkova <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc3, v6.6-rc2 |
|
| #
133c4c0d |
| 11-Sep-2023 |
Eric Dumazet <[email protected]> |
tcp: defer regular ACK while processing socket backlog
This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk tr
tcp: defer regular ACK while processing socket backlog
This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk transfers.
For high throughput flows, it is best to use one cpu running the user thread issuing socket system calls, and a separate cpu to process incoming packets from BH context. (With TSO/GRO, bottleneck is usually the 'user' cpu)
Problem is the user thread can spend a lot of time while holding the socket lock, forcing BH handler to queue most of incoming packets in the socket backlog.
Whenever the user thread releases the socket lock, it must first process all accumulated packets in the backlog, potentially adding latency spikes. Due to flood mitigation, having too many packets in the backlog increases chance of unexpected drops.
Backlog processing unfortunately shifts a fair amount of cpu cycles from the BH cpu to the 'user' cpu, thus reducing max throughput.
This patch takes advantage of the backlog processing, and the fact that ACK are mostly cumulative.
The idea is to detect we are in the backlog processing and defer all eligible ACK into a single one, sent from tcp_release_cb().
This saves cpu cycles on both sides, and network resources.
Performance of a single TCP flow on a 200Gbit NIC:
- Throughput is increased by 20% (100Gbit -> 120Gbit). - Number of generated ACK per second shrinks from 240,000 to 40,000. - Number of backlog drops per second shrinks from 230 to 0.
Benchmark context: - Regular netperf TCP_STREAM (no zerocopy) - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids) - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
This feature is guarded by a new sysctl, and enabled by default: /proc/sys/net/ipv4/tcp_backlog_ack_defer
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Dave Taht <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4 |
|
| #
5027d54a |
| 26-Jul-2023 |
Patrick Rohr <[email protected]> |
net: change accept_ra_min_rtr_lft to affect all RA lifetimes
accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly.
This change renames accept
net: change accept_ra_min_rtr_lft to affect all RA lifetimes
accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly.
This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and applies the value to individual RA sections; in particular, router lifetime, PIO preferred lifetime, and RIO lifetime. If any of those lifetimes are lower than the configured value, the specific RA section is ignored.
In order for the sysctl to be useful to Android, it should really apply to all lifetimes in the RA, since that is what determines the minimum frequency at which RAs must be processed by the kernel. Android uses hardware offloads to drop RAs for a fraction of the minimum of all lifetimes present in the RA (some networks have very frequent RAs (5s) with high lifetimes (2h)). Despite this, we have encountered networks that set the router lifetime to 30s which results in very frequent CPU wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the WiFi firmware) entirely on such networks, it seems better to ignore the misconfigured routers while still processing RAs from other IPv6 routers on the same network (i.e. to support IoT applications).
The previous implementation dropped the entire RA based on router lifetime. This turned out to be hard to expand to the other lifetimes present in the RA in a consistent manner; dropping the entire RA based on RIO/PIO lifetimes would essentially require parsing the whole thing twice.
Fixes: 1671bcfd76fd ("net: add sysctl accept_ra_min_rtr_lft") Cc: Lorenzo Colitti <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc3 |
|
| #
1671bcfd |
| 19-Jul-2023 |
Patrick Rohr <[email protected]> |
net: add sysctl accept_ra_min_rtr_lft
This change adds a new sysctl accept_ra_min_rtr_lft to specify the minimum acceptable router lifetime in an RA. If the received RA router lifetime is less than
net: add sysctl accept_ra_min_rtr_lft
This change adds a new sysctl accept_ra_min_rtr_lft to specify the minimum acceptable router lifetime in an RA. If the received RA router lifetime is less than the configured value (and not 0), the RA is ignored. This is useful for mobile devices, whose battery life can be impacted by networks that configure RAs with a short lifetime. On such networks, the device should never gain IPv6 provisioning and should attempt to drop RAs via hardware offload, if available.
Signed-off-by: Patrick Rohr <[email protected]> Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
dfa2f048 |
| 17-Jul-2023 |
Eric Dumazet <[email protected]> |
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7 |
|
| #
b650d953 |
| 12-Jun-2023 |
[email protected] <[email protected]> |
tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to inco
tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP.
To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2].
As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window.
But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window
Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323
Signed-off-by: Mike Freemon <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc6, v6.4-rc5 |
|
| #
e209fee4 |
| 01-Jun-2023 |
Akihiro Suda <[email protected]> |
net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
With this commit, all the GIDs ("0 4294967294") can be written to the "net.ipv4.ping_group_range" sysctl.
Note that 4294967295 (0
net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
With this commit, all the GIDs ("0 4294967294") can be written to the "net.ipv4.ping_group_range" sysctl.
Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in include/linux/uidgid.h), and an attempt to register this number will cause -EINVAL.
Prior to this commit, only up to GID 2147483647 could be covered. Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example value, but this example was wrong and causing -EINVAL.
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Co-developed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Akihiro Suda <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc4, v6.4-rc3, v6.4-rc2 |
|
| #
ccce324d |
| 09-May-2023 |
David Morley <[email protected]> |
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In such cases, it's better to aggressively try to retransmit packets, so it takes routers less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ... MacOS and IOS have used this backoff schedule for over a decade, since before this 2009 IETF presentation discussed the behavior: https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of linear backoffs given by the following sysctl: * tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Tested-by: David Morley <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc1, v6.3 |
|
| #
7ab75456 |
| 19-Apr-2023 |
Mahesh Bandewar <[email protected]> |
ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6
ICMPv6 error packets are not sent to the anycast destinations and this prevents things like traceroute from working. So create a setting similar
ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6
ICMPv6 error packets are not sent to the anycast destinations and this prevents things like traceroute from working. So create a setting similar to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast).
Signed-off-by: Mahesh Bandewar <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6 |
|
| #
dc5110c2 |
| 06-Apr-2023 |
YueHaibing <[email protected]> |
tcp: restrict net.ipv4.tcp_app_win
UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23 shift exponent 255 is too large for 32-bit type 'int' CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00
tcp: restrict net.ipv4.tcp_app_win
UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23 shift exponent 255 is too large for 32-bit type 'int' CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x136/0x150 __ubsan_handle_shift_out_of_bounds+0x21f/0x5a0 tcp_init_transfer.cold+0x3a/0xb9 tcp_finish_connect+0x1d0/0x620 tcp_rcv_state_process+0xd78/0x4d60 tcp_v4_do_rcv+0x33d/0x9d0 __release_sock+0x133/0x3b0 release_sock+0x58/0x1b0
'maxwin' is int, shifting int for 32 or more bits is undefined behaviour.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7 |
|
| #
62e395f8 |
| 30-Jan-2023 |
Brian Haley <[email protected]> |
neighbor: fix proxy_delay usage when it is zero
When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random
neighbor: fix proxy_delay usage when it is zero
When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random delay between [0, U32_MAX). Looking at this comment from __get_random_u32_below() explains the reason:
/* * This function is technically undefined for ceil == 0, and in fact * for the non-underscored constant version in the header, we build bug * on that. But for the non-constant case, it's convenient to have that * evaluate to being a straight call to get_random_u32(), so that * get_random_u32_inclusive() can work over its whole range without * undefined behavior. */
Added helper function that does not call get_random_u32_below() if proxy_delay is zero and just uses the current value of jiffies instead, causing pneigh_enqueue() to respond immediately.
Also added definition of proxy_delay to ip-sysctl.txt since it was missing.
Signed-off-by: Brian Haley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|