History log of /linux-6.15/include/net/ip.h (Results 1 – 25 of 302)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 8965c160 01-Apr-2025 Stanislav Fomichev <[email protected]>

net: use netif_disable_lro in ipv6_add_dev

ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.

net: use netif_disable_lro in ipv6_add_dev

ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.

Make sure all callers hold the instance lock as well.

Cc: Cosmin Ratiu <[email protected]>
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6
# d4438ce6 05-Mar-2025 Eric Dumazet <[email protected]>

inet: call inet6_ehashfn() once from inet6_hash_connect()

inet6_ehashfn() being called from __inet6_check_established()
has a big impact on performance, as shown in the Tested section.

After prior

inet: call inet6_ehashfn() once from inet6_hash_connect()

inet6_ehashfn() being called from __inet6_check_established()
has a big impact on performance, as shown in the Tested section.

After prior patch, we can compute the hash for port 0
from inet6_hash_connect(), and derive each hash in
__inet_hash_connect() from this initial hash:

hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport

Apply the same principle for __inet_check_established(),
although inet_ehashfn() has a smaller cost.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

utime_start=0.286131
utime_end=4.378886
stime_start=11.952556
stime_end=1991.655533
num_transactions=1446830
latency_min=0.001061085
latency_max=12.075275028
latency_mean=0.376375302
latency_stddev=1.361969596
num_samples=306383
throughput=151866.56

perf top:

50.01% [kernel] [k] __inet6_check_established
20.65% [kernel] [k] __inet_hash_connect
15.81% [kernel] [k] inet6_ehashfn
2.92% [kernel] [k] rcu_all_qs
2.34% [kernel] [k] __cond_resched
0.50% [kernel] [k] _raw_spin_lock
0.34% [kernel] [k] sched_balance_trigger
0.24% [kernel] [k] queued_spin_lock_slowpath

After this patch:

utime_start=0.315047
utime_end=9.257617
stime_start=7.041489
stime_end=1923.688387
num_transactions=3057968
latency_min=0.003041375
latency_max=7.056589232
latency_mean=0.141075048 # Better latency metrics
latency_stddev=0.526900516
num_samples=312996
throughput=320677.21 # 111 % increase, and 229 % for the series

perf top: inet6_ehashfn is no longer seen.

39.67% [kernel] [k] __inet_hash_connect
37.06% [kernel] [k] __inet6_check_established
4.79% [kernel] [k] rcu_all_qs
3.82% [kernel] [k] __cond_resched
1.76% [kernel] [k] sched_balance_domains
0.82% [kernel] [k] _raw_spin_lock
0.81% [kernel] [k] sched_balance_rq
0.81% [kernel] [k] sched_balance_trigger
0.76% [kernel] [k] queued_spin_lock_slowpath

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Tested-by: Jason Xing <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3
# 9329b583 14-Feb-2025 Willem de Bruijn <[email protected]>

ipv4: remove get_rttos

Initialize the ip cookie tos field when initializing the cookie, in
ipcm_init_sk.

The existing code inverts the standard pattern for initializing cookie
fields. Default is to

ipv4: remove get_rttos

Initialize the ip cookie tos field when initializing the cookie, in
ipcm_init_sk.

The existing code inverts the standard pattern for initializing cookie
fields. Default is to initialize the field from the sk, then possibly
overwrite that when parsing cmsgs (the unlikely case).

This field inverts that, setting the field to an illegal value and
after cmsg parsing checking whether the value is still illegal and
thus should be overridden.

Be careful to always apply mask INET_DSCP_MASK, as before.

Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 94788792 14-Feb-2025 Willem de Bruijn <[email protected]>

ipv4: initialize inet socket cookies with sockcm_init

Avoid open coding the same logic.

Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: http

ipv4: initialize inet socket cookies with sockcm_init

Avoid open coding the same logic.

Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 54568a84 12-Feb-2025 Eric Dumazet <[email protected]>

net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()

We have many EXPORT_SYMBOL(x) in networking tree because IPv6
can be built as a module.

CONFIG_IPV6=y is becoming the norm.

Define a EXPO

net: introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL()

We have many EXPORT_SYMBOL(x) in networking tree because IPv6
can be built as a module.

CONFIG_IPV6=y is becoming the norm.

Define a EXPORT_IPV6_MOD(x) which only exports x
for modular IPv6.

Same principle applies to EXPORT_IPV6_MOD_GPL()

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Mateusz Polchlopek <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc2
# 071d8012 05-Feb-2025 Eric Dumazet <[email protected]>

ipv4: use RCU protection in ip_dst_mtu_maybe_forward()

ip_dst_mtu_maybe_forward() must use RCU protection to make
sure the net structure it reads does not disappear.

Fixes: f87c10a8aa1e8 ("ipv4: in

ipv4: use RCU protection in ip_dst_mtu_maybe_forward()

ip_dst_mtu_maybe_forward() must use RCU protection to make
sure the net structure it reads does not disappear.

Fixes: f87c10a8aa1e8 ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# a32f3e9d 13-Dec-2024 Anna Emese Nyiri <[email protected]>

sock: support SO_PRIORITY cmsg

The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to th

sock: support SO_PRIORITY cmsg

The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.

According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a22 ("ip: support SO_MARK cmsg").

If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.

This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.

Reviewed-by: Willem de Bruijn <[email protected]>
Suggested-by: Ferenc Fejes <[email protected]>
Signed-off-by: Anna Emese Nyiri <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4
# c972c1c4 18-Oct-2024 Kuniyuki Iwashima <[email protected]>

ipv4: Switch inet_addr_hash() to less predictable hash.

Recently, commit 4a0ec2aa0704 ("ipv6: switch inet6_addr_hash()
to less predictable hash") and commit 4daf4dc275f1 ("ipv6: switch
inet6_acaddr_

ipv4: Switch inet_addr_hash() to less predictable hash.

Recently, commit 4a0ec2aa0704 ("ipv6: switch inet6_addr_hash()
to less predictable hash") and commit 4daf4dc275f1 ("ipv6: switch
inet6_acaddr_hash() to less predictable hash") hardened IPv6
address hash functions.

inet_addr_hash() is also highly predictable, and a malicious use
could abuse a specific bucket.

Let's follow the change on IPv4 by using jhash_1word().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


Revision tags: v6.12-rc3
# 79636038 10-Oct-2024 Eric Dumazet <[email protected]>

ipv4: tcp: give socket pointer to control skbs

ip_send_unicast_reply() send orphaned 'control packets'.

These are RST packets and also ACK packets sent from TIME_WAIT.

Some eBPF programs would pre

ipv4: tcp: give socket pointer to control skbs

ip_send_unicast_reply() send orphaned 'control packets'.

These are RST packets and also ACK packets sent from TIME_WAIT.

Some eBPF programs would prefer to have a meaningful skb->sk
pointer as much as possible.

This means that TCP can now attach TIME_WAIT sockets to outgoing
skbs.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Brian Vazquez <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc2
# 7e863e5d 01-Oct-2024 Guillaume Nault <[email protected]>

ipv4: Convert ip_route_input() to dscp_t.

Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input() t

ipv4: Convert ip_route_input() to dscp_t.

Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input() to consider are:

* input_action_end_dx4_finish() and input_action_end_dt4() in
net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
which is already a valid dscp_t value, so they don't need to be
adjusted for the new prototype.

* icmp_route_lookup(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.

* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
which get the DSCP directly from IPv4 headers. Define a helper to
read the .tos field of struct iphdr as dscp_t, so that these
function don't have to do the conversion manually.

While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6
# 356d054a 29-Aug-2024 Ido Schimmel <[email protected]>

ipv4: Unmask upper DSCP bits in get_rttos()

The function is used by a few socket types to retrieve the TOS value
with which to perform the FIB lookup for packets sent through the socket
(flowi4_tos)

ipv4: Unmask upper DSCP bits in get_rttos()

The function is used by a few socket types to retrieve the TOS value
with which to perform the FIB lookup for packets sent through the socket
(flowi4_tos). If a DS field was passed using the IP_TOS control message,
then it is used. Otherwise the one specified via the IP_TOS socket
option.

Unmask the upper DSCP bits so that in the future the lookup could be
performed according to the full DSCP value.

Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


# f17bf505 29-Aug-2024 Eric Dumazet <[email protected]>

icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netns

Previous patch made ICMP rate limits per netns, it makes sense
to allow each netns to change the associated sysctl.

Signed-off-b

icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netns

Previous patch made ICMP rate limits per netns, it makes sense
to allow each netns to change the associated sysctl.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# b056b4cd 29-Aug-2024 Eric Dumazet <[email protected]>

icmp: move icmp_global.credit and icmp_global.stamp to per netns storage

Host wide ICMP ratelimiter should be per netns, to provide better isolation.

Following patch in this series makes the sysctl

icmp: move icmp_global.credit and icmp_global.stamp to per netns storage

Host wide ICMP ratelimiter should be per netns, to provide better isolation.

Following patch in this series makes the sysctl per netns.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 8c2bd38b 29-Aug-2024 Eric Dumazet <[email protected]>

icmp: change the order of rate limits

ICMP messages are ratelimited :

After the blamed commits, the two rate limiters are applied in this order:

1) host wide ratelimit (icmp_global_allow())

2) Pe

icmp: change the order of rate limits

ICMP messages are ratelimited :

After the blamed commits, the two rate limiters are applied in this order:

1) host wide ratelimit (icmp_global_allow())

2) Per destination ratelimit (inetpeer based)

In order to avoid side-channels attacks, we need to apply
the per destination check first.

This patch makes the following change :

1) icmp_global_allow() checks if the host wide limit is reached.
But credits are not yet consumed. This is deferred to 3)

2) The per destination limit is checked/updated.
This might add a new node in inetpeer tree.

3) icmp_global_consume() consumes tokens if prior operations succeeded.

This means that host wide ratelimit is still effective
in keeping inetpeer tree small even under DDOS.

As a bonus, I removed icmp_global.lock as the fast path
can use a lock-free operation.

Fixes: c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited")
Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Reported-by: Keyu Man <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2
# 61e2bbaf 31-May-2024 Jason Xing <[email protected]>

net: remove NULL-pointer net parameter in ip_metrics_convert

When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always trigge

net: remove NULL-pointer net parameter in ip_metrics_convert

When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always triggers NULL
pointer crash. Then I digged into this part, realizing that we can remove
this one due to its uselessness.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


Revision tags: v6.10-rc1, v6.9, v6.9-rc7
# 05d6d492 29-Apr-2024 Eric Dumazet <[email protected]>

inet: introduce dst_rtable() helper

I added dst_rt6_info() in commit
e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper")

This patch does a similar change for IPv4.

Instead of (struct rtable *)d

inet: introduce dst_rtable() helper

I added dst_rt6_info() in commit
e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper")

This patch does a similar change for IPv4.

Instead of (struct rtable *)dst casts, we can use :

#define dst_rtable(_ptr) \
container_of_const(_ptr, struct rtable, dst)

Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Sabrina Dubroca <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2
# e622502c 25-Jan-2024 Nicolas Dichtel <[email protected]>

ipmr: fix kernel panic when forwarding mcast packets

The stacktrace was:
[ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092
[ 86.306815] #PF: supervisor read access in

ipmr: fix kernel panic when forwarding mcast packets

The stacktrace was:
[ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092
[ 86.306815] #PF: supervisor read access in kernel mode
[ 86.307717] #PF: error_code(0x0000) - not-present page
[ 86.308624] PGD 0 P4D 0
[ 86.309091] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 86.309883] CPU: 2 PID: 3139 Comm: pimd Tainted: G U 6.8.0-6wind-knet #1
[ 86.311027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[ 86.312728] RIP: 0010:ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985)
[ 86.313399] Code: f9 1f 0f 87 85 03 00 00 48 8d 04 5b 48 8d 04 83 49 8d 44 c5 00 48 8b 40 70 48 39 c2 0f 84 d9 00 00 00 49 8b 46 58 48 83 e0 fe <80> b8 92 00 00 00 00 0f 84 55 ff ff ff 49 83 47 38 01 45 85 e4 0f
[ 86.316565] RSP: 0018:ffffad21c0583ae0 EFLAGS: 00010246
[ 86.317497] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 86.318596] RDX: ffff9559cb46c000 RSI: 0000000000000000 RDI: 0000000000000000
[ 86.319627] RBP: ffffad21c0583b30 R08: 0000000000000000 R09: 0000000000000000
[ 86.320650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 86.321672] R13: ffff9559c093a000 R14: ffff9559cc00b800 R15: ffff9559c09c1d80
[ 86.322873] FS: 00007f85db661980(0000) GS:ffff955a79d00000(0000) knlGS:0000000000000000
[ 86.324291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 86.325314] CR2: 0000000000000092 CR3: 000000002f13a000 CR4: 0000000000350ef0
[ 86.326589] Call Trace:
[ 86.327036] <TASK>
[ 86.327434] ? show_regs (/build/work/knet/arch/x86/kernel/dumpstack.c:479)
[ 86.328049] ? __die (/build/work/knet/arch/x86/kernel/dumpstack.c:421 /build/work/knet/arch/x86/kernel/dumpstack.c:434)
[ 86.328508] ? page_fault_oops (/build/work/knet/arch/x86/mm/fault.c:707)
[ 86.329107] ? do_user_addr_fault (/build/work/knet/arch/x86/mm/fault.c:1264)
[ 86.329756] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.330350] ? __irq_work_queue_local (/build/work/knet/kernel/irq_work.c:111 (discriminator 1))
[ 86.331013] ? exc_page_fault (/build/work/knet/./arch/x86/include/asm/paravirt.h:693 /build/work/knet/arch/x86/mm/fault.c:1515 /build/work/knet/arch/x86/mm/fault.c:1563)
[ 86.331702] ? asm_exc_page_fault (/build/work/knet/./arch/x86/include/asm/idtentry.h:570)
[ 86.332468] ? ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985)
[ 86.333183] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.333920] ipmr_mfc_add (/build/work/knet/./include/linux/rcupdate.h:782 /build/work/knet/net/ipv4/ipmr.c:1009 /build/work/knet/net/ipv4/ipmr.c:1273)
[ 86.334583] ? __pfx_ipmr_hash_cmp (/build/work/knet/net/ipv4/ipmr.c:363)
[ 86.335357] ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470)
[ 86.336135] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.336854] ? ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470)
[ 86.337679] do_ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:944)
[ 86.338408] ? __pfx_unix_stream_read_actor (/build/work/knet/net/unix/af_unix.c:2862)
[ 86.339232] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.339809] ? aa_sk_perm (/build/work/knet/security/apparmor/include/cred.h:153 /build/work/knet/security/apparmor/net.c:181)
[ 86.340342] ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:1415)
[ 86.340859] raw_setsockopt (/build/work/knet/net/ipv4/raw.c:836)
[ 86.341408] ? security_socket_setsockopt (/build/work/knet/security/security.c:4561 (discriminator 13))
[ 86.342116] sock_common_setsockopt (/build/work/knet/net/core/sock.c:3716)
[ 86.342747] do_sock_setsockopt (/build/work/knet/net/socket.c:2313)
[ 86.343363] __sys_setsockopt (/build/work/knet/./include/linux/file.h:32 /build/work/knet/net/socket.c:2336)
[ 86.344020] __x64_sys_setsockopt (/build/work/knet/net/socket.c:2340)
[ 86.344766] do_syscall_64 (/build/work/knet/arch/x86/entry/common.c:52 /build/work/knet/arch/x86/entry/common.c:83)
[ 86.345433] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.346161] ? syscall_exit_work (/build/work/knet/./include/linux/audit.h:357 /build/work/knet/kernel/entry/common.c:160)
[ 86.346938] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.347657] ? syscall_exit_to_user_mode (/build/work/knet/kernel/entry/common.c:215)
[ 86.348538] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
[ 86.349262] ? do_syscall_64 (/build/work/knet/./arch/x86/include/asm/cpufeature.h:171 /build/work/knet/arch/x86/entry/common.c:98)
[ 86.349971] entry_SYSCALL_64_after_hwframe (/build/work/knet/arch/x86/entry/entry_64.S:129)

The original packet in ipmr_cache_report() may be queued and then forwarded
with ip_mr_forward(). This last function has the assumption that the skb
dst is set.

After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(),
which causes the oops.

Fixes: bb7403655b3c ("ipmr: support IP_PKTINFO on cache report IGMP msg")
Signed-off-by: Nicolas Dichtel <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6
# 41db7626 14-Dec-2023 Eric Dumazet <[email protected]>

inet: returns a bool from inet_sk_get_local_port_range()

Change inet_sk_get_local_port_range() to return a boolean,
telling the callers if the port range was provided by
IP_LOCAL_PORT_RANGE socket o

inet: returns a bool from inet_sk_get_local_port_range()

Change inet_sk_get_local_port_range() to return a boolean,
telling the callers if the port range was provided by
IP_LOCAL_PORT_RANGE socket option.

Adds documentation while we are at it.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.7-rc5
# d9f28735 06-Dec-2023 David Laight <[email protected]>

Use READ/WRITE_ONCE() for IP local_port_range.

Commit 227b60f5102cd added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit

Use READ/WRITE_ONCE() for IP local_port_range.

Commit 227b60f5102cd added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit port numbers can be held in
a u32 and read/written in a single instruction.

More recently 91d0b78c5177f added support for finer per-socket limits.
The user-supplied value is 'high << 16 | low' but they are held
separately and the socket options protected by the socket lock.

Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk'
fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
always updated together.

Change (the now trival) inet_get_local_port_range() to a static inline
to optimise the calling code.
(In particular avoiding returning integers by reference.)

Signed-off-by: David Laight <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Acked-by: Mat Martineau <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7
# 878d951c 18-Oct-2023 Eric Dumazet <[email protected]>

inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do

inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do not all rely
on bare RCU.

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321
Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69
RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293
RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01
RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002
RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0
R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580
R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000
FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0
Call Trace:
<TASK>
tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567
tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419
tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839
tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323
__inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676
tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021
tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336
__sock_sendmsg+0x83/0xd0 net/socket.c:730
__sys_sendto+0x20a/0x2a0 net/socket.c:2194
__do_sys_sendto net/socket.c:2206 [inline]

Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS")
Reported-by: Christoph Paasch <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3
# e08d0b3d 22-Sep-2023 Eric Dumazet <[email protected]>

inet: implement lockless IP_TOS

Some reads of inet->tos are racy.

Add needed READ_ONCE() annotations and convert IP_TOS option lockless.

v2: missing changes in include/net/route.h (David Ahern)

S

inet: implement lockless IP_TOS

Some reads of inet->tos are racy.

Add needed READ_ONCE() annotations and convert IP_TOS option lockless.

v2: missing changes in include/net/route.h (David Ahern)

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


# ceaa7141 22-Sep-2023 Eric Dumazet <[email protected]>

inet: implement lockless IP_MTU_DISCOVER

inet->pmtudisc can be read locklessly.

Implement proper lockless reads and writes to inet->pmtudisc

ip_sock_set_mtu_discover() can now be called from arbit

inet: implement lockless IP_MTU_DISCOVER

inet->pmtudisc can be read locklessly.

Implement proper lockless reads and writes to inet->pmtudisc

ip_sock_set_mtu_discover() can now be called from arbitrary
contexts.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


Revision tags: v6.6-rc2, v6.6-rc1
# 6ac66cb0 31-Aug-2023 Sriram Yagnaraman <[email protected]>

ipv4: ignore dst hint for multipath routes

Route hints when the nexthop is part of a multipath group causes packets
in the same receive batch to be sent to the same nexthop irrespective of
the multi

ipv4: ignore dst hint for multipath routes

Route hints when the nexthop is part of a multipath group causes packets
in the same receive batch to be sent to the same nexthop irrespective of
the multipath hash of the packet. So, do not extract route hint for
packets whose destination is part of a multipath group.

A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the
flag when route is looked up in ip_mkroute_input() and use it in
ip_extract_route_hint() to check for the existence of the flag.

Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive")
Signed-off-by: Sriram Yagnaraman <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


# e3390b30 31-Aug-2023 Eric Dumazet <[email protected]>

net: annotate data-races around sk->sk_tsflags

sk->sk_tsflags can be read locklessly, add corresponding annotations.

Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags")
Sign

net: annotate data-races around sk->sk_tsflags

sk->sk_tsflags can be read locklessly, add corresponding annotations.

Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


Revision tags: v6.5, v6.5-rc7
# f866fbc8 19-Aug-2023 Eric Dumazet <[email protected]>

ipv4: fix data-races around inet->inet_id

UDP sendmsg() is lockless, so ip_select_ident_segs()
can very well be run from multiple cpus [1]

Convert inet->inet_id to an atomic_t, but implement
a dedi

ipv4: fix data-races around inet->inet_id

UDP sendmsg() is lockless, so ip_select_ident_segs()
can very well be run from multiple cpus [1]

Convert inet->inet_id to an atomic_t, but implement
a dedicated path for TCP, avoiding cost of a locked
instruction (atomic_add_return())

Note that this patch will cause a trivial merge conflict
because we added inet->flags in net-next tree.

v2: added missing change in
drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
(David Ahern)

[1]

BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb

read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1:
ip_select_ident_segs include/net/ip.h:542 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0:
ip_select_ident_segs include/net/ip.h:541 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x184d -> 0x184e

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
==================================================================

Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


12345678910>>...13