|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
8965c160 |
| 01-Apr-2025 |
Stanislav Fomichev <[email protected]> |
net: use netif_disable_lro in ipv6_add_dev
ipv6_add_dev might call dev_disable_lro which unconditionally grabs instance lock, so it will deadlock during NETDEV_REGISTER. Switch to netif_disable_lro.
net: use netif_disable_lro in ipv6_add_dev
ipv6_add_dev might call dev_disable_lro which unconditionally grabs instance lock, so it will deadlock during NETDEV_REGISTER. Switch to netif_disable_lro.
Make sure all callers hold the instance lock as well.
Cc: Cosmin Ratiu <[email protected]> Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
d4438ce6 |
| 05-Mar-2025 |
Eric Dumazet <[email protected]> |
inet: call inet6_ehashfn() once from inet6_hash_connect()
inet6_ehashfn() being called from __inet6_check_established() has a big impact on performance, as shown in the Tested section.
After prior
inet: call inet6_ehashfn() once from inet6_hash_connect()
inet6_ehashfn() being called from __inet6_check_established() has a big impact on performance, as shown in the Tested section.
After prior patch, we can compute the hash for port 0 from inet6_hash_connect(), and derive each hash in __inet_hash_connect() from this initial hash:
hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport
Apply the same principle for __inet_check_established(), although inet_ehashfn() has a smaller cost.
Tested:
Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
Before this patch:
utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56
perf top:
50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath
After this patch:
utime_start=0.315047 utime_end=9.257617 stime_start=7.041489 stime_end=1923.688387 num_transactions=3057968 latency_min=0.003041375 latency_max=7.056589232 latency_mean=0.141075048 # Better latency metrics latency_stddev=0.526900516 num_samples=312996 throughput=320677.21 # 111 % increase, and 229 % for the series
perf top: inet6_ehashfn is no longer seen.
39.67% [kernel] [k] __inet_hash_connect 37.06% [kernel] [k] __inet6_check_established 4.79% [kernel] [k] rcu_all_qs 3.82% [kernel] [k] __cond_resched 1.76% [kernel] [k] sched_balance_domains 0.82% [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] sched_balance_rq 0.81% [kernel] [k] sched_balance_trigger 0.76% [kernel] [k] queued_spin_lock_slowpath
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Tested-by: Jason Xing <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
9329b583 |
| 14-Feb-2025 |
Willem de Bruijn <[email protected]> |
ipv4: remove get_rttos
Initialize the ip cookie tos field when initializing the cookie, in ipcm_init_sk.
The existing code inverts the standard pattern for initializing cookie fields. Default is to
ipv4: remove get_rttos
Initialize the ip cookie tos field when initializing the cookie, in ipcm_init_sk.
The existing code inverts the standard pattern for initializing cookie fields. Default is to initialize the field from the sk, then possibly overwrite that when parsing cmsgs (the unlikely case).
This field inverts that, setting the field to an illegal value and after cmsg parsing checking whether the value is still illegal and thus should be overridden.
Be careful to always apply mask INET_DSCP_MASK, as before.
Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
94788792 |
| 14-Feb-2025 |
Willem de Bruijn <[email protected]> |
ipv4: initialize inet socket cookies with sockcm_init
Avoid open coding the same logic.
Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: http
ipv4: initialize inet socket cookies with sockcm_init
Avoid open coding the same logic.
Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
54568a84 |
| 12-Feb-2025 |
Eric Dumazet <[email protected]> |
net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()
We have many EXPORT_SYMBOL(x) in networking tree because IPv6 can be built as a module.
CONFIG_IPV6=y is becoming the norm.
Define a EXPO
net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()
We have many EXPORT_SYMBOL(x) in networking tree because IPv6 can be built as a module.
CONFIG_IPV6=y is becoming the norm.
Define a EXPORT_IPV6_MOD(x) which only exports x for modular IPv6.
Same principle applies to EXPORT_IPV6_MOD_GPL()
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Mateusz Polchlopek <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
071d8012 |
| 05-Feb-2025 |
Eric Dumazet <[email protected]> |
ipv4: use RCU protection in ip_dst_mtu_maybe_forward()
ip_dst_mtu_maybe_forward() must use RCU protection to make sure the net structure it reads does not disappear.
Fixes: f87c10a8aa1e8 ("ipv4: in
ipv4: use RCU protection in ip_dst_mtu_maybe_forward()
ip_dst_mtu_maybe_forward() must use RCU protection to make sure the net structure it reads does not disappear.
Fixes: f87c10a8aa1e8 ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing") Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
a32f3e9d |
| 13-Dec-2024 |
Anna Emese Nyiri <[email protected]> |
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to th
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to this is IP_TOS, when the priority value is calculated during the handling of ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data"). However, this is a computed value, and there is currently no mechanism to set a custom priority via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data, the packet is sent with the priority value set through sockc->priority, overriding the socket-level values set via the traditional setsockopt() method. This is analogous to the existing support for SO_MARK, as implemented in commit c6af0c227a22 ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg IP_TOS.
Reviewed-by: Willem de Bruijn <[email protected]> Suggested-by: Ferenc Fejes <[email protected]> Signed-off-by: Anna Emese Nyiri <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4 |
|
| #
c972c1c4 |
| 18-Oct-2024 |
Kuniyuki Iwashima <[email protected]> |
ipv4: Switch inet_addr_hash() to less predictable hash.
Recently, commit 4a0ec2aa0704 ("ipv6: switch inet6_addr_hash() to less predictable hash") and commit 4daf4dc275f1 ("ipv6: switch inet6_acaddr_
ipv4: Switch inet_addr_hash() to less predictable hash.
Recently, commit 4a0ec2aa0704 ("ipv6: switch inet6_addr_hash() to less predictable hash") and commit 4daf4dc275f1 ("ipv6: switch inet6_acaddr_hash() to less predictable hash") hardened IPv6 address hash functions.
inet_addr_hash() is also highly predictable, and a malicious use could abuse a specific bucket.
Let's follow the change on IPv4 by using jhash_1word().
Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc3 |
|
| #
79636038 |
| 10-Oct-2024 |
Eric Dumazet <[email protected]> |
ipv4: tcp: give socket pointer to control skbs
ip_send_unicast_reply() send orphaned 'control packets'.
These are RST packets and also ACK packets sent from TIME_WAIT.
Some eBPF programs would pre
ipv4: tcp: give socket pointer to control skbs
ip_send_unicast_reply() send orphaned 'control packets'.
These are RST packets and also ACK packets sent from TIME_WAIT.
Some eBPF programs would prefer to have a meaningful skb->sk pointer as much as possible.
This means that TCP can now attach TIME_WAIT sockets to outgoing skbs.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Brian Vazquez <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2 |
|
| #
7e863e5d |
| 01-Oct-2024 |
Guillaume Nault <[email protected]> |
ipv4: Convert ip_route_input() to dscp_t.
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() t
ipv4: Convert ip_route_input() to dscp_t.
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() to consider are:
* input_action_end_dx4_finish() and input_action_end_dt4() in net/ipv6/seg6_local.c. These functions set the tos parameter to 0, which is already a valid dscp_t value, so they don't need to be adjusted for the new prototype.
* icmp_route_lookup(), which already has a dscp_t variable to pass as parameter. We just need to remove the inet_dscp_to_dsfield() conversion.
* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(), which get the DSCP directly from IPv4 headers. Define a helper to read the .tos field of struct iphdr as dscp_t, so that these function don't have to do the conversion manually.
While there, declare *iph as const in br_nf_pre_routing_finish(), declare its local variables in reverse-christmas-tree order and move the "err = ip_route_input()" assignment out of the conditional to avoid checkpatch warning.
Signed-off-by: Guillaume Nault <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
356d054a |
| 29-Aug-2024 |
Ido Schimmel <[email protected]> |
ipv4: Unmask upper DSCP bits in get_rttos()
The function is used by a few socket types to retrieve the TOS value with which to perform the FIB lookup for packets sent through the socket (flowi4_tos)
ipv4: Unmask upper DSCP bits in get_rttos()
The function is used by a few socket types to retrieve the TOS value with which to perform the FIB lookup for packets sent through the socket (flowi4_tos). If a DS field was passed using the IP_TOS control message, then it is used. Otherwise the one specified via the IP_TOS socket option.
Unmask the upper DSCP bits so that in the future the lookup could be performed according to the full DSCP value.
Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
f17bf505 |
| 29-Aug-2024 |
Eric Dumazet <[email protected]> |
icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netns
Previous patch made ICMP rate limits per netns, it makes sense to allow each netns to change the associated sysctl.
Signed-off-b
icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netns
Previous patch made ICMP rate limits per netns, it makes sense to allow each netns to change the associated sysctl.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
b056b4cd |
| 29-Aug-2024 |
Eric Dumazet <[email protected]> |
icmp: move icmp_global.credit and icmp_global.stamp to per netns storage
Host wide ICMP ratelimiter should be per netns, to provide better isolation.
Following patch in this series makes the sysctl
icmp: move icmp_global.credit and icmp_global.stamp to per netns storage
Host wide ICMP ratelimiter should be per netns, to provide better isolation.
Following patch in this series makes the sysctl per netns.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
8c2bd38b |
| 29-Aug-2024 |
Eric Dumazet <[email protected]> |
icmp: change the order of rate limits
ICMP messages are ratelimited :
After the blamed commits, the two rate limiters are applied in this order:
1) host wide ratelimit (icmp_global_allow())
2) Pe
icmp: change the order of rate limits
ICMP messages are ratelimited :
After the blamed commits, the two rate limiters are applied in this order:
1) host wide ratelimit (icmp_global_allow())
2) Per destination ratelimit (inetpeer based)
In order to avoid side-channels attacks, we need to apply the per destination check first.
This patch makes the following change :
1) icmp_global_allow() checks if the host wide limit is reached. But credits are not yet consumed. This is deferred to 3)
2) The per destination limit is checked/updated. This might add a new node in inetpeer tree.
3) icmp_global_consume() consumes tokens if prior operations succeeded.
This means that host wide ratelimit is still effective in keeping inetpeer tree small even under DDOS.
As a bonus, I removed icmp_global.lock as the fast path can use a lock-free operation.
Fixes: c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited") Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Reported-by: Keyu Man <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
61e2bbaf |
| 31-May-2024 |
Jason Xing <[email protected]> |
net: remove NULL-pointer net parameter in ip_metrics_convert
When I was doing some experiments, I found that when using the first parameter, namely, struct net, in ip_metrics_convert() always trigge
net: remove NULL-pointer net parameter in ip_metrics_convert
When I was doing some experiments, I found that when using the first parameter, namely, struct net, in ip_metrics_convert() always triggers NULL pointer crash. Then I digged into this part, realizing that we can remove this one due to its uselessness.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7 |
|
| #
05d6d492 |
| 29-Apr-2024 |
Eric Dumazet <[email protected]> |
inet: introduce dst_rtable() helper
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper")
This patch does a similar change for IPv4.
Instead of (struct rtable *)d
inet: introduce dst_rtable() helper
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper")
This patch does a similar change for IPv4.
Instead of (struct rtable *)dst casts, we can use :
#define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst)
Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
e622502c |
| 25-Jan-2024 |
Nicolas Dichtel <[email protected]> |
ipmr: fix kernel panic when forwarding mcast packets
The stacktrace was: [ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092 [ 86.306815] #PF: supervisor read access in
ipmr: fix kernel panic when forwarding mcast packets
The stacktrace was: [ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092 [ 86.306815] #PF: supervisor read access in kernel mode [ 86.307717] #PF: error_code(0x0000) - not-present page [ 86.308624] PGD 0 P4D 0 [ 86.309091] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 86.309883] CPU: 2 PID: 3139 Comm: pimd Tainted: G U 6.8.0-6wind-knet #1 [ 86.311027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [ 86.312728] RIP: 0010:ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.313399] Code: f9 1f 0f 87 85 03 00 00 48 8d 04 5b 48 8d 04 83 49 8d 44 c5 00 48 8b 40 70 48 39 c2 0f 84 d9 00 00 00 49 8b 46 58 48 83 e0 fe <80> b8 92 00 00 00 00 0f 84 55 ff ff ff 49 83 47 38 01 45 85 e4 0f [ 86.316565] RSP: 0018:ffffad21c0583ae0 EFLAGS: 00010246 [ 86.317497] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 86.318596] RDX: ffff9559cb46c000 RSI: 0000000000000000 RDI: 0000000000000000 [ 86.319627] RBP: ffffad21c0583b30 R08: 0000000000000000 R09: 0000000000000000 [ 86.320650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 86.321672] R13: ffff9559c093a000 R14: ffff9559cc00b800 R15: ffff9559c09c1d80 [ 86.322873] FS: 00007f85db661980(0000) GS:ffff955a79d00000(0000) knlGS:0000000000000000 [ 86.324291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 86.325314] CR2: 0000000000000092 CR3: 000000002f13a000 CR4: 0000000000350ef0 [ 86.326589] Call Trace: [ 86.327036] <TASK> [ 86.327434] ? show_regs (/build/work/knet/arch/x86/kernel/dumpstack.c:479) [ 86.328049] ? __die (/build/work/knet/arch/x86/kernel/dumpstack.c:421 /build/work/knet/arch/x86/kernel/dumpstack.c:434) [ 86.328508] ? page_fault_oops (/build/work/knet/arch/x86/mm/fault.c:707) [ 86.329107] ? do_user_addr_fault (/build/work/knet/arch/x86/mm/fault.c:1264) [ 86.329756] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.330350] ? __irq_work_queue_local (/build/work/knet/kernel/irq_work.c:111 (discriminator 1)) [ 86.331013] ? exc_page_fault (/build/work/knet/./arch/x86/include/asm/paravirt.h:693 /build/work/knet/arch/x86/mm/fault.c:1515 /build/work/knet/arch/x86/mm/fault.c:1563) [ 86.331702] ? asm_exc_page_fault (/build/work/knet/./arch/x86/include/asm/idtentry.h:570) [ 86.332468] ? ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.333183] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.333920] ipmr_mfc_add (/build/work/knet/./include/linux/rcupdate.h:782 /build/work/knet/net/ipv4/ipmr.c:1009 /build/work/knet/net/ipv4/ipmr.c:1273) [ 86.334583] ? __pfx_ipmr_hash_cmp (/build/work/knet/net/ipv4/ipmr.c:363) [ 86.335357] ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.336135] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.336854] ? ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.337679] do_ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:944) [ 86.338408] ? __pfx_unix_stream_read_actor (/build/work/knet/net/unix/af_unix.c:2862) [ 86.339232] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.339809] ? aa_sk_perm (/build/work/knet/security/apparmor/include/cred.h:153 /build/work/knet/security/apparmor/net.c:181) [ 86.340342] ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:1415) [ 86.340859] raw_setsockopt (/build/work/knet/net/ipv4/raw.c:836) [ 86.341408] ? security_socket_setsockopt (/build/work/knet/security/security.c:4561 (discriminator 13)) [ 86.342116] sock_common_setsockopt (/build/work/knet/net/core/sock.c:3716) [ 86.342747] do_sock_setsockopt (/build/work/knet/net/socket.c:2313) [ 86.343363] __sys_setsockopt (/build/work/knet/./include/linux/file.h:32 /build/work/knet/net/socket.c:2336) [ 86.344020] __x64_sys_setsockopt (/build/work/knet/net/socket.c:2340) [ 86.344766] do_syscall_64 (/build/work/knet/arch/x86/entry/common.c:52 /build/work/knet/arch/x86/entry/common.c:83) [ 86.345433] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.346161] ? syscall_exit_work (/build/work/knet/./include/linux/audit.h:357 /build/work/knet/kernel/entry/common.c:160) [ 86.346938] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.347657] ? syscall_exit_to_user_mode (/build/work/knet/kernel/entry/common.c:215) [ 86.348538] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.349262] ? do_syscall_64 (/build/work/knet/./arch/x86/include/asm/cpufeature.h:171 /build/work/knet/arch/x86/entry/common.c:98) [ 86.349971] entry_SYSCALL_64_after_hwframe (/build/work/knet/arch/x86/entry/entry_64.S:129)
The original packet in ipmr_cache_report() may be queued and then forwarded with ip_mr_forward(). This last function has the assumption that the skb dst is set.
After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(), which causes the oops.
Fixes: bb7403655b3c ("ipmr: support IP_PKTINFO on cache report IGMP msg") Signed-off-by: Nicolas Dichtel <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
41db7626 |
| 14-Dec-2023 |
Eric Dumazet <[email protected]> |
inet: returns a bool from inet_sk_get_local_port_range()
Change inet_sk_get_local_port_range() to return a boolean, telling the callers if the port range was provided by IP_LOCAL_PORT_RANGE socket o
inet: returns a bool from inet_sk_get_local_port_range()
Change inet_sk_get_local_port_range() to return a boolean, telling the callers if the port range was provided by IP_LOCAL_PORT_RANGE socket option.
Adds documentation while we are at it.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5 |
|
| #
d9f28735 |
| 06-Dec-2023 |
David Laight <[email protected]> |
Use READ/WRITE_ONCE() for IP local_port_range.
Commit 227b60f5102cd added a seqlock to ensure that the low and high port numbers were always updated together. This is overkill because the two 16bit
Use READ/WRITE_ONCE() for IP local_port_range.
Commit 227b60f5102cd added a seqlock to ensure that the low and high port numbers were always updated together. This is overkill because the two 16bit port numbers can be held in a u32 and read/written in a single instruction.
More recently 91d0b78c5177f added support for finer per-socket limits. The user-supplied value is 'high << 16 | low' but they are held separately and the socket options protected by the socket lock.
Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk' fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are always updated together.
Change (the now trival) inet_get_local_port_range() to a static inline to optimise the calling code. (In particular avoiding returning integers by reference.)
Signed-off-by: David Laight <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Acked-by: Mat Martineau <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7 |
|
| #
878d951c |
| 18-Oct-2023 |
Eric Dumazet <[email protected]> |
inet: lock the socket in ip_sock_set_tos()
Christoph Paasch reported a panic in TCP stack [1]
Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do
inet: lock the socket in ip_sock_set_tos()
Christoph Paasch reported a panic in TCP stack [1]
Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU.
[1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline]
Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3 |
|
| #
e08d0b3d |
| 22-Sep-2023 |
Eric Dumazet <[email protected]> |
inet: implement lockless IP_TOS
Some reads of inet->tos are racy.
Add needed READ_ONCE() annotations and convert IP_TOS option lockless.
v2: missing changes in include/net/route.h (David Ahern)
S
inet: implement lockless IP_TOS
Some reads of inet->tos are racy.
Add needed READ_ONCE() annotations and convert IP_TOS option lockless.
v2: missing changes in include/net/route.h (David Ahern)
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
ceaa7141 |
| 22-Sep-2023 |
Eric Dumazet <[email protected]> |
inet: implement lockless IP_MTU_DISCOVER
inet->pmtudisc can be read locklessly.
Implement proper lockless reads and writes to inet->pmtudisc
ip_sock_set_mtu_discover() can now be called from arbit
inet: implement lockless IP_MTU_DISCOVER
inet->pmtudisc can be read locklessly.
Implement proper lockless reads and writes to inet->pmtudisc
ip_sock_set_mtu_discover() can now be called from arbitrary contexts.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1 |
|
| #
6ac66cb0 |
| 31-Aug-2023 |
Sriram Yagnaraman <[email protected]> |
ipv4: ignore dst hint for multipath routes
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multi
ipv4: ignore dst hint for multipath routes
Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group.
A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag.
Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
e3390b30 |
| 31-Aug-2023 |
Eric Dumazet <[email protected]> |
net: annotate data-races around sk->sk_tsflags
sk->sk_tsflags can be read locklessly, add corresponding annotations.
Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags") Sign
net: annotate data-races around sk->sk_tsflags
sk->sk_tsflags can be read locklessly, add corresponding annotations.
Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags") Signed-off-by: Eric Dumazet <[email protected]> Cc: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7 |
|
| #
f866fbc8 |
| 19-Aug-2023 |
Eric Dumazet <[email protected]> |
ipv4: fix data-races around inet->inet_id
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1]
Convert inet->inet_id to an atomic_t, but implement a dedi
ipv4: fix data-races around inet->inet_id
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1]
Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return())
Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree.
v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern)
[1]
BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb
read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x184d -> 0x184e
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ==================================================================
Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|