ip-sysctl.rst - OpenGrok history log for /linux-6.15/Documentation/networking/ip-sysctl.rst

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# f38805c5	17-Mar-2025	Jason Xing <[email protected]>	tcp: support TCP_RTO_MIN_US for set/getsockopt use Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means tcp: support TCP_RTO_MIN_US for set/getsockopt use Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit. This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min. Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2
# 1280c262	07-Feb-2025	Eric Dumazet <[email protected]>	tcp: add tcp_rto_max_ms sysctl Previous patch added a TCP_RTO_MAX_MS socket option to tune a TCP socket max RTO value. Many setups prefer to change a per netns sysctl. This patch adds /proc/sys/ne tcp: add tcp_rto_max_ms sysctl Previous patch added a TCP_RTO_MAX_MS socket option to tune a TCP socket max RTO value. Many setups prefer to change a per netns sysctl. This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms Its initial value is 120000 (120 seconds). Keep in mind that a decrease of tcp_rto_max_ms means shorter overall timeouts, unless tcp_retries2 sysctl is increased. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> show more ...
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# ca6a6f93	09-Dec-2024	Jakub Sitnicki <[email protected]>	tcp: Add sysctl to configure TIME-WAIT reuse delay Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be reused by reopening a connection. This is a safe choice based on an assum tcp: Add sysctl to configure TIME-WAIT reuse delay Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be reused by reopening a connection. This is a safe choice based on an assumption that the other TCP timestamp clock frequency, which is unknown to us, may be as low as 1 Hz (RFC 7323, section 5.4). However, this means that in the presence of short lived connections with an RTT of couple of milliseconds, the time during which a 4-tuple is blocked from reuse can be orders of magnitude longer that the connection lifetime. Combined with a reduced pool of ephemeral ports, when using IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the long TIME-WAIT reuse delay can lead to port exhaustion, where all available 4-tuples are tied up in TIME-WAIT state. Turn the reuse delay into a per-netns setting so that sysadmins can make more aggressive assumptions about remote TCP timestamp clock frequency and shorten the delay in order to allow connections to reincarnate faster. Note that applications can completely bypass the TIME-WAIT delay protection already today by locking the local port with bind() before connecting. Such immediate connection reuse may result in PAWS failing to detect old duplicate segments, leaving us with just the sequence number check as a safety net. This new configurable offers a trade off where the sysadmin can balance between the risk of PAWS detection failing to act versus exhausting ports by having sockets tied up in TIME-WAIT state for too long. [1] https://lpc.events/event/16/contributions/1349/ Signed-off-by: Jakub Sitnicki <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com Signed-off-by: Jakub Kicinski <[email protected]> show more ...
# bbe4b412	09-Dec-2024	Petr Machata <[email protected]>	Documentation: networking: Add a caveat to nexthop_compat_mode sysctl net.ipv4.nexthop_compat_mode was added when nexthop objects were added to provide the view of nexthop objects through the usual Documentation: networking: Add a caveat to nexthop_compat_mode sysctl net.ipv4.nexthop_compat_mode was added when nexthop objects were added to provide the view of nexthop objects through the usual lens of the route UAPI. As nexthop objects evolved, the information provided through this lens became incomplete. For example, details of resilient nexthop groups are obviously omitted. Now that 16-bit nexthop group weights are a thing, the 8-bit UAPI cannot convey the >8-bit weight accurately. Instead of inventing workarounds for an obsolete interface, just document the expectations of inaccuracy. Fixes: b72a6a7ab957 ("net: nexthop: Increase weight to u16") Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/b575e32399ccacd09079b2a218255164535123bd.1733740749.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# 990c3049	29-Jul-2024	Patrick Rohr <[email protected]>	Add support for PIO p flag draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information Option to signal that the network can allocate a unique IPv6 prefix per client via DHCPv6-PD (see Add support for PIO p flag draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information Option to signal that the network can allocate a unique IPv6 prefix per client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device). When ra_honor_pio_pflag is enabled, the presence of a P-flag causes SLAAC autoconfiguration to be disabled for that particular PIO. An automated test has been added in Android (r.android.com/3195335) to go along with this change. Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Cc: David Lamparter <[email protected]> Cc: Simon Horman <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3
# 4ee2a8ca	07-Jun-2024	Petr Machata <[email protected]>	net: ipv4: Add a sysctl to set multipath hash seed When calculating hashes for the purpose of multipath forwarding, both IPv4 and IPv6 code currently fall back on flow_hash_from_keys(). That uses a net: ipv4: Add a sysctl to set multipath hash seed When calculating hashes for the purpose of multipath forwarding, both IPv4 and IPv6 code currently fall back on flow_hash_from_keys(). That uses a randomly-generated seed. That's a fine choice by default, but unfortunately some deployments may need a tighter control over the seed used. In this patch, make the seed configurable by adding a new sysctl key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used specifically for multipath forwarding and not for the other concerns that flow_hash_from_keys() is used for, such as queue selection. Expose the knob as sysctl because other such settings, such as headers to hash, are also handled that way. Like those, the multipath hash seed is a per-netns variable. Despite being placed in the net.ipv4 namespace, the multipath seed sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of TCP variables. The seed used by flow_hash_from_keys() is a 128-bit quantity. However it seems that usually the seed is a much more modest value. 32 bits seem typical (Cisco, Cumulus), some systems go even lower. For that reason, and to decouple the user interface from implementation details, go with a 32-bit quantity, which is then quadruplicated to form the siphash key. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
# f086edef	03-Jun-2024	Kevin Yang <[email protected]>	tcp: add sysctl_tcp_rto_min_us Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min. Note that the rto_min rout tcp: add sysctl_tcp_rto_min_us Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min. Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN socket option, followed by the tcp_rto_min_us sysctl. Signed-off-by: Kevin Yang <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Yuchung Cheng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Tony Lu <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5
# f4bcbf36	14-Feb-2024	Alex Henrie <[email protected]>	net: ipv6/addrconf: clamp preferred_lft to the minimum required If the preferred lifetime was less than the minimum required lifetime, ipv6_create_tempaddr would error out without creating any new a net: ipv6/addrconf: clamp preferred_lft to the minimum required If the preferred lifetime was less than the minimum required lifetime, ipv6_create_tempaddr would error out without creating any new address. On my machine and network, this error happened immediately with the preferred lifetime set to 5 seconds or less, after a few minutes with the preferred lifetime set to 6 seconds, and not at all with the preferred lifetime set to 7 seconds. During my investigation, I found a Stack Exchange post from another person who seems to have had the same problem: They stopped getting new addresses if they lowered the preferred lifetime below 3 seconds, and they didn't really know why. The preferred lifetime is a preference, not a hard requirement. The kernel does not strictly forbid new connections on a deprecated address, nor does it guarantee that the address will be disposed of the instant its total valid lifetime expires. So rather than disable IPv6 privacy extensions altogether if the minimum required lifetime swells above the preferred lifetime, it is more in keeping with the user's intent to increase the temporary address's lifetime to the minimum necessary for the current network conditions. With these fixes, setting the preferred lifetime to 5 or 6 seconds "just works" because the extra fraction of a second is practically unnoticeable. It's even possible to reduce the time before deprecation to 1 or 2 seconds by setting /proc/sys/net/ipv6/conf//regen_min_advance and /proc/sys/net/ipv6/conf//dad_transmits to 0. I realize that that is a pretty niche use case, but I know at least one person who would gladly sacrifice performance and convenience to be sure that they are getting the maximum possible level of privacy. Link: https://serverfault.com/a/1031168/310447 Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> show more ...
# a5fcea2d	14-Feb-2024	Alex Henrie <[email protected]>	net: ipv6/addrconf: introduce a regen_min_advance sysctl In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC does not permit the creation of temporary addresses with lifetimes shor net: ipv6/addrconf: introduce a regen_min_advance sysctl In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC does not permit the creation of temporary addresses with lifetimes shorter than that: > When processing a Router Advertisement with a > Prefix Information option carrying a prefix for the purposes of > address autoconfiguration (i.e., the A bit is set), the host MUST > perform the following steps: > 5. A temporary address is created only if this calculated preferred > lifetime is greater than REGEN_ADVANCE time units. However, some users want to change their IPv6 address as frequently as possible regardless of the RFC's arbitrary minimum lifetime. For the benefit of those users, add a regen_min_advance sysctl parameter that can be set to below or above 2 seconds. Link: https://datatracker.ietf.org/doc/html/rfc8981 Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> show more ...
# 2aa8f155	14-Feb-2024	Alex Henrie <[email protected]>	net: ipv6/addrconf: ensure that regen_advance is at least 2 seconds RFC 8981 defines REGEN_ADVANCE as follows: REGEN_ADVANCE = 2 + (TEMP_IDGEN_RETRIES * DupAddrDetectTransmits * RetransTimer / 1000 net: ipv6/addrconf: ensure that regen_advance is at least 2 seconds RFC 8981 defines REGEN_ADVANCE as follows: REGEN_ADVANCE = 2 + (TEMP_IDGEN_RETRIES * DupAddrDetectTransmits * RetransTimer / 1000) Thus, allowing it to be less than 2 seconds is technically a protocol violation. Link: https://datatracker.ietf.org/doc/html/rfc8981#name-defined-protocol-parameters Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> show more ...
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8
# 8cdafdd9	30-Dec-2023	Alex Henrie <[email protected]>	Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum required" The commit had a bug and might not have been the right approach anyway. Fixes: 629df6701c8a ("net: ipv6/addrconf: clamp pref Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum required" The commit had a bug and might not have been the right approach anyway. Fixes: 629df6701c8a ("net: ipv6/addrconf: clamp preferred_lft to the minimum required") Fixes: ec575f885e3e ("Documentation: networking: explain what happens if temp_prefered_lft is too small or too large") Reported-by: Dan Moulding <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/CAMMLpeTdYhd=7hhPi2Y7pwdPCgnnW5JYh-bu3hSc7im39uxnEA@mail.gmail.com/ Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6
# ec575f88	24-Oct-2023	Alex Henrie <[email protected]>	Documentation: networking: explain what happens if temp_prefered_lft is too small or too large Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: Documentation: networking: explain what happens if temp_prefered_lft is too small or too large Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
# 433d6c80	24-Oct-2023	Alex Henrie <[email protected]>	Documentation: networking: explain what happens if temp_valid_lft is too small Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.ke Documentation: networking: explain what happens if temp_valid_lft is too small Signed-off-by: Alex Henrie <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.6-rc7, v6.6-rc6
# 562b1fdf	11-Oct-2023	Haiyang Zhang <[email protected]>	tcp: Set pingpong threshold via sysctl TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode fo tcp: Set pingpong threshold via sysctl TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.6-rc5, v6.6-rc4
# 473267a4	25-Sep-2023	Patrick Rohr <[email protected]>	net: add sysctl to disable rfc4862 5.5.3e lifetime handling This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid lifetime derivation mechanism. RFC4862 section 5.5.3e prescribes t net: add sysctl to disable rfc4862 5.5.3e lifetime handling This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid lifetime derivation mechanism. RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router Advertisement PIO shall be ignored if it less than 2 hours and to reset the lifetime of the corresponding address to 2 hours. An in-progress 6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently looking to remove this mechanism. While this draft has not been moving particularly quickly for other reasons, there is widespread consensus on section 4.2 which updates RFC4862 section 5.5.3e. Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Cc: Jen Linkova <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.6-rc3, v6.6-rc2
# 133c4c0d	11-Sep-2023	Eric Dumazet <[email protected]>	tcp: defer regular ACK while processing socket backlog This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk tr tcp: defer regular ACK while processing socket backlog This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk transfers. For high throughput flows, it is best to use one cpu running the user thread issuing socket system calls, and a separate cpu to process incoming packets from BH context. (With TSO/GRO, bottleneck is usually the 'user' cpu) Problem is the user thread can spend a lot of time while holding the socket lock, forcing BH handler to queue most of incoming packets in the socket backlog. Whenever the user thread releases the socket lock, it must first process all accumulated packets in the backlog, potentially adding latency spikes. Due to flood mitigation, having too many packets in the backlog increases chance of unexpected drops. Backlog processing unfortunately shifts a fair amount of cpu cycles from the BH cpu to the 'user' cpu, thus reducing max throughput. This patch takes advantage of the backlog processing, and the fact that ACK are mostly cumulative. The idea is to detect we are in the backlog processing and defer all eligible ACK into a single one, sent from tcp_release_cb(). This saves cpu cycles on both sides, and network resources. Performance of a single TCP flow on a 200Gbit NIC: - Throughput is increased by 20% (100Gbit -> 120Gbit). - Number of generated ACK per second shrinks from 240,000 to 40,000. - Number of backlog drops per second shrinks from 230 to 0. Benchmark context: - Regular netperf TCP_STREAM (no zerocopy) - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids) - MAX_SKB_FRAGS = 17 (~60KB per GRO packet) This feature is guarded by a new sysctl, and enabled by default: /proc/sys/net/ipv4/tcp_backlog_ack_defer Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Dave Taht <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> show more ...
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4
# 5027d54a	26-Jul-2023	Patrick Rohr <[email protected]>	net: change accept_ra_min_rtr_lft to affect all RA lifetimes accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly. This change renames accept net: change accept_ra_min_rtr_lft to affect all RA lifetimes accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly. This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and applies the value to individual RA sections; in particular, router lifetime, PIO preferred lifetime, and RIO lifetime. If any of those lifetimes are lower than the configured value, the specific RA section is ignored. In order for the sysctl to be useful to Android, it should really apply to all lifetimes in the RA, since that is what determines the minimum frequency at which RAs must be processed by the kernel. Android uses hardware offloads to drop RAs for a fraction of the minimum of all lifetimes present in the RA (some networks have very frequent RAs (5s) with high lifetimes (2h)). Despite this, we have encountered networks that set the router lifetime to 30s which results in very frequent CPU wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the WiFi firmware) entirely on such networks, it seems better to ignore the misconfigured routers while still processing RAs from other IPv6 routers on the same network (i.e. to support IoT applications). The previous implementation dropped the entire RA based on router lifetime. This turned out to be hard to expand to the other lifetimes present in the RA in a consistent manner; dropping the entire RA based on RIO/PIO lifetimes would essentially require parsing the whole thing twice. Fixes: 1671bcfd76fd ("net: add sysctl accept_ra_min_rtr_lft") Cc: Lorenzo Colitti <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.5-rc3
# 1671bcfd	19-Jul-2023	Patrick Rohr <[email protected]>	net: add sysctl accept_ra_min_rtr_lft This change adds a new sysctl accept_ra_min_rtr_lft to specify the minimum acceptable router lifetime in an RA. If the received RA router lifetime is less than net: add sysctl accept_ra_min_rtr_lft This change adds a new sysctl accept_ra_min_rtr_lft to specify the minimum acceptable router lifetime in an RA. If the received RA router lifetime is less than the configured value (and not 0), the RA is ignored. This is useful for mobile devices, whose battery life can be impacted by networks that configure RAs with a short lifetime. On such networks, the device should never gain IPv6 provisioning and should attempt to drop RAs via hardware offload, if available. Signed-off-by: Patrick Rohr <[email protected]> Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
# dfa2f048	17-Jul-2023	Eric Dumazet <[email protected]>	tcp: get rid of sysctl_tcp_adv_win_scale With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to tcp: get rid of sysctl_tcp_adv_win_scale With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7
# b650d953	12-Jun-2023	[email protected] <[email protected]>	tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to inco tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
Revision tags: v6.4-rc6, v6.4-rc5
# e209fee4	01-Jun-2023	Akihiro Suda <[email protected]>	net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294 With this commit, all the GIDs ("0 4294967294") can be written to the "net.ipv4.ping_group_range" sysctl. Note that 4294967295 (0 net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294 With this commit, all the GIDs ("0 4294967294") can be written to the "net.ipv4.ping_group_range" sysctl. Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in include/linux/uidgid.h), and an attempt to register this number will cause -EINVAL. Prior to this commit, only up to GID 2147483647 could be covered. Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example value, but this example was wrong and causing -EINVAL. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Co-developed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Akihiro Suda <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
Revision tags: v6.4-rc4, v6.4-rc3, v6.4-rc2
# ccce324d	09-May-2023	David Morley <[email protected]>	tcp: make the first N SYN RTO backoffs linear Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In tcp: make the first N SYN RTO backoffs linear Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In such cases, it's better to aggressively try to retransmit packets, so it takes routers less time to find a repath with a working link. We chose a default value for this sysctl of 4, to follow the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ... MacOS and IOS have used this backoff schedule for over a decade, since before this 2009 IETF presentation discussed the behavior: https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf This commit makes the SYN RTO schedule start with a number of linear backoffs given by the following sysctl: * tcp_syn_linear_timeouts This changes the SYN RTO scheme to be: init_rto_val for tcp_syn_linear_timeouts, exp backoff starting at init_rto_val For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ... Signed-off-by: David Morley <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Tested-by: David Morley <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]> show more ...
Revision tags: v6.4-rc1, v6.3
# 7ab75456	19-Apr-2023	Mahesh Bandewar <[email protected]>	ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6 ICMPv6 error packets are not sent to the anycast destinations and this prevents things like traceroute from working. So create a setting similar ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6 ICMPv6 error packets are not sent to the anycast destinations and this prevents things like traceroute from working. So create a setting similar to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast). Signed-off-by: Mahesh Bandewar <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.3-rc7, v6.3-rc6
# dc5110c2	06-Apr-2023	YueHaibing <[email protected]>	tcp: restrict net.ipv4.tcp_app_win UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23 shift exponent 255 is too large for 32-bit type 'int' CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00 tcp: restrict net.ipv4.tcp_app_win UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23 shift exponent 255 is too large for 32-bit type 'int' CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x136/0x150 __ubsan_handle_shift_out_of_bounds+0x21f/0x5a0 tcp_init_transfer.cold+0x3a/0xb9 tcp_finish_connect+0x1d0/0x620 tcp_rcv_state_process+0xd78/0x4d60 tcp_v4_do_rcv+0x33d/0x9d0 __release_sock+0x133/0x3b0 release_sock+0x58/0x1b0 'maxwin' is int, shifting int for 32 or more bits is undefined behaviour. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]> show more ...
Revision tags: v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7
# 62e395f8	30-Jan-2023	Brian Haley <[email protected]>	neighbor: fix proxy_delay usage when it is zero When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random neighbor: fix proxy_delay usage when it is zero When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random delay between [0, U32_MAX). Looking at this comment from __get_random_u32_below() explains the reason: /* * This function is technically undefined for ceil == 0, and in fact * for the non-underscored constant version in the header, we build bug * on that. But for the non-constant case, it's convenient to have that * evaluate to being a straight call to get_random_u32(), so that * get_random_u32_inclusive() can work over its whole range without * undefined behavior. */ Added helper function that does not call get_random_u32_below() if proxy_delay is zero and just uses the current value of jiffies instead, causing pneigh_enqueue() to respond immediately. Also added definition of proxy_delay to ip-sysctl.txt since it was missing. Signed-off-by: Brian Haley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> show more ...
12 3 4