History log of /linux-6.15/include/net/sock.h (Results 1 – 25 of 761)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2
# 0bb2f7a1 07-Apr-2025 Kuniyuki Iwashima <[email protected]>

net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.

When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-d

net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.

When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]

Reproduction Steps:

1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule

At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.

At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.

# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445

# lsmod | grep cifs
cifs 1159168 0

This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.

While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().

Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.

Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.

If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.

Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().

Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.

[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"

MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1

iptables -A INPUT -s ${CIFS_SERVER} -j DROP

for i in $(seq 10);
do
umount ${MNT}
rmmod cifs
sleep 1
done

rm -r ${MNT}

iptables -D INPUT -s ${CIFS_SERVER} -j DROP

[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
<IRQ>
__lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...

BUG: kernel NULL pointer dereference, address: 00000000000000c4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:ffa0000000468a10 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027
RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88
RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff
R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88
R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1
FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
ip_list_rcv_finish (net/ipv4/ip_input.c:628)
ip_list_rcv (net/ipv4/ip_input.c:670)
__netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
__napi_poll.constprop.0 (net/core/dev.c:7191)
net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
handle_softirqs (kernel/softirq.c:561)
__irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
irq_exit_rcu (kernel/softirq.c:680)
common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
</IRQ>
<TASK>
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202
RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5
RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:315)
common_startup_64 (arch/x86/kernel/head_64.S:421)
</TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2: 00000000000000c4

Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Cc: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5
# 456cc675 28-Feb-2025 Geliang Tang <[email protected]>

sock: add sock_kmemdup helper

This patch adds the sock version of kmemdup() helper, named sock_kmemdup(),
to duplicate the input "src" memory block using the socket's option memory
buffer.

Signed-o

sock: add sock_kmemdup helper

This patch adds the sock version of kmemdup() helper, named sock_kmemdup(),
to duplicate the input "src" memory block using the socket's option memory
buffer.

Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/f828077394c7d1f3560123497348b438c875b510.1740735165.git.tanggeliang@kylinos.cn
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc4
# 5c70eb5c 20-Feb-2025 Eric Dumazet <[email protected]>

net: better track kernel sockets lifetime

While kernel sockets are dismantled during pernet_operations->exit(),
their freeing can be delayed by any tx packets still held in qdisc
or device queues, d

net: better track kernel sockets lifetime

While kernel sockets are dismantled during pernet_operations->exit(),
their freeing can be delayed by any tx packets still held in qdisc
or device queues, due to skb_set_owner_w() prior calls.

This then trigger the following warning from ref_tracker_dir_exit() [1]

To fix this, make sure that kernel sockets own a reference on net->passive.

Add sk_net_refcnt_upgrade() helper, used whenever a kernel socket
is converted to a refcounted one.

[1]

[ 136.263918][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[ 136.263918][ T35] sk_alloc+0x2b3/0x370
[ 136.263918][ T35] inet6_create+0x6ce/0x10f0
[ 136.263918][ T35] __sock_create+0x4c0/0xa30
[ 136.263918][ T35] inet_ctl_sock_create+0xc2/0x250
[ 136.263918][ T35] igmp6_net_init+0x39/0x390
[ 136.263918][ T35] ops_init+0x31e/0x590
[ 136.263918][ T35] setup_net+0x287/0x9e0
[ 136.263918][ T35] copy_net_ns+0x33f/0x570
[ 136.263918][ T35] create_new_namespaces+0x425/0x7b0
[ 136.263918][ T35] unshare_nsproxy_namespaces+0x124/0x180
[ 136.263918][ T35] ksys_unshare+0x57d/0xa70
[ 136.263918][ T35] __x64_sys_unshare+0x38/0x40
[ 136.263918][ T35] do_syscall_64+0xf3/0x230
[ 136.263918][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 136.263918][ T35]
[ 136.343488][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[ 136.343488][ T35] sk_alloc+0x2b3/0x370
[ 136.343488][ T35] inet6_create+0x6ce/0x10f0
[ 136.343488][ T35] __sock_create+0x4c0/0xa30
[ 136.343488][ T35] inet_ctl_sock_create+0xc2/0x250
[ 136.343488][ T35] ndisc_net_init+0xa7/0x2b0
[ 136.343488][ T35] ops_init+0x31e/0x590
[ 136.343488][ T35] setup_net+0x287/0x9e0
[ 136.343488][ T35] copy_net_ns+0x33f/0x570
[ 136.343488][ T35] create_new_namespaces+0x425/0x7b0
[ 136.343488][ T35] unshare_nsproxy_namespaces+0x124/0x180
[ 136.343488][ T35] ksys_unshare+0x57d/0xa70
[ 136.343488][ T35] __x64_sys_unshare+0x38/0x40
[ 136.343488][ T35] do_syscall_64+0xf3/0x230
[ 136.343488][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 0cafd77dcd03 ("net: add a refcount tracker for kernel sockets")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/T/#u
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# df600f3b 20-Feb-2025 Jason Xing <[email protected]>

bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping

This patch introduces a new bpf_skops_tx_timestamping() function
that prepares the "struct bpf_sock_ops" ctx and then executes the

bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping

This patch introduces a new bpf_skops_tx_timestamping() function
that prepares the "struct bpf_sock_ops" ctx and then executes the
sockops BPF program.

The subsequent patch will utilize bpf_skops_tx_timestamping() at
the existing TX timestamping kernel callbacks (__sk_tstamp_tx
specifically) to call the sockops BPF program. Later, four callback
points to report information to user space based on this patch will
be introduced.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 24e82b7c 20-Feb-2025 Jason Xing <[email protected]>

bpf: Add networking timestamping support to bpf_get/setsockopt()

The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
added to bpf_get/setsockopt. The later patches will implement the
BPF n

bpf: Add networking timestamping support to bpf_get/setsockopt()

The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
added to bpf_get/setsockopt. The later patches will implement the
BPF networking timestamping. The BPF program will use
bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to
enable the BPF networking timestamping on a socket.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# c8802ded 18-Feb-2025 Paolo Abeni <[email protected]>

net: dismiss sk_forward_alloc_get()

After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and

net: dismiss sk_forward_alloc_get()

After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").

Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-5-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc3
# 6ad86151 14-Feb-2025 Willem de Bruijn <[email protected]>

net: initialize mark in sockcm_init

Avoid open coding initialization of sockcm fields.
Avoid reading the sk_priority field twice.

This ensures all callers, existing and future, will correctly try a

net: initialize mark in sockcm_init

Avoid open coding initialization of sockcm fields.
Avoid reading the sk_priority field twice.

This ensures all callers, existing and future, will correctly try a
cmsg passed mark before sk_mark.

This patch extends support for cmsg mark to:
packet_spkt and packet_tpacket and net/can/raw.c.

This patch extends support for cmsg priority to:
packet_spkt and packet_tpacket.

Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# f0e70409 11-Feb-2025 Paolo Abeni <[email protected]>

net: avoid unconditionally touching sk_tsflags on RX

After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"),
the sk_tsflags field shares the same cacheline with sk_forward_alloc.

The UD

net: avoid unconditionally touching sk_tsflags on RX

After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"),
the sk_tsflags field shares the same cacheline with sk_forward_alloc.

The UDP protocol does not acquire the sock lock in the RX path;
forward allocations are protected via the receive queue spinlock;
additionally udp_recvmsg() calls sock_recv_cmsgs() unconditionally
touching sk_tsflags on each packet reception.

Due to the above, under high packet rate traffic, when the BH and the
user-space process run on different CPUs, UDP packet reception
experiences a cache miss while accessing sk_tsflags.

The receive path doesn't strictly need to access the problematic field;
change sock_set_timestamping() to maintain the relevant information
in a newly allocated sk_flags bit, so that sock_recv_cmsgs() can
take decisions accessing the latter field only.

With this patch applied, on an AMD epic server with i40e NICs, I
measured a 10% performance improvement for small packets UDP flood
performance tests - possibly a larger delta could be observed with more
recent H/W.

Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/dbd18c8a1171549f8249ac5a8b30b1b5ec88a425.1739294057.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7
# b2849867 07-Jan-2025 Oleg Nesterov <[email protected]>

sock_poll_wait: kill the no longer necessary barrier after poll_wait()

Now that poll_wait() provides a full barrier we can remove smp_mb() from
sock_poll_wait().

Also, the poll_does_not_wait() chec

sock_poll_wait: kill the no longer necessary barrier after poll_wait()

Now that poll_wait() provides a full barrier we can remove smp_mb() from
sock_poll_wait().

Also, the poll_does_not_wait() check before poll_wait() just adds the
unnecessary confusion, kill it. poll_wait() does the same "p && p->_qproc"
check.

Signed-off-by: Oleg Nesterov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3
# 54f89b31 10-Dec-2024 Cong Wang <[email protected]>

tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()

When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
r

tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()

When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
receive socket buffer, instead of sending socket buffer.

Because sk_rmem_schedule() tests pfmemalloc of skb, we need to
introduce a wrapper and call it for skmsg.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

show more ...


# e45469e5 13-Dec-2024 Anna Emese Nyiri <[email protected]>

sock: Introduce SO_RCVPRIORITY socket option

Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for

sock: Introduce SO_RCVPRIORITY socket option

Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option
for SO_MARK with recvmsg()").

Reviewed-by: Willem de Bruijn <[email protected]>
Suggested-by: Ferenc Fejes <[email protected]>
Signed-off-by: Anna Emese Nyiri <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# a32f3e9d 13-Dec-2024 Anna Emese Nyiri <[email protected]>

sock: support SO_PRIORITY cmsg

The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to th

sock: support SO_PRIORITY cmsg

The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.

According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a22 ("ip: support SO_MARK cmsg").

If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.

This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.

Reviewed-by: Willem de Bruijn <[email protected]>
Suggested-by: Ferenc Fejes <[email protected]>
Signed-off-by: Anna Emese Nyiri <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3
# 9c5bd93e 13-Oct-2024 Michal Luczaj <[email protected]>

bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock

Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure
to immediately and visibly fail the forwarding of unsu

bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock

Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure
to immediately and visibly fail the forwarding of unsupported af_vsock
packets.

Fixes: 634f1a7110b4 ("vsock: support sockmap")
Signed-off-by: Michal Luczaj <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

show more ...


# 5ced52fa 10-Oct-2024 Eric Dumazet <[email protected]>

net: add skb_set_owner_edemux() helper

This can be used to attach a socket to an skb,
taking a reference on sk->sk_refcnt.

This helper might be a NOP if sk->sk_refcnt is zero.

Use it from tcp_make

net: add skb_set_owner_edemux() helper

This can be used to attach a socket to an skb,
taking a reference on sk->sk_refcnt.

This helper might be a NOP if sk->sk_refcnt is zero.

Use it from tcp_make_synack().

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Brian Vazquez <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# bc43a3c8 10-Oct-2024 Eric Dumazet <[email protected]>

net_sched: sch_fq: prepare for TIME_WAIT sockets

TCP stack is not attaching skb to TIME_WAIT sockets yet,
but we would like to allow this in the future.

Add sk_listener_or_tw() helper to detect the

net_sched: sch_fq: prepare for TIME_WAIT sockets

TCP stack is not attaching skb to TIME_WAIT sockets yet,
but we would like to allow this in the future.

Add sk_listener_or_tw() helper to detect the three states
that FQ needs to take care.

Like NEW_SYN_RECV, TIME_WAIT are not full sockets and
do not contain sk->sk_pacing_status, sk->sk_pacing_rate.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Brian Vazquez <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc2
# da5e06de 05-Oct-2024 Jason Xing <[email protected]>

net-timestamp: namespacify the sysctl_tstamp_allow_data

Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>

net-timestamp: namespacify the sysctl_tstamp_allow_data

Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 1dae9f11 03-Oct-2024 Anastasia Kovaleva <[email protected]>

net: Fix an unsafe loop on the list

The kernel may crash when deleting a genetlink family if there are still
listeners for that family:

Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c0

net: Fix an unsafe loop on the list

The kernel may crash when deleting a genetlink family if there are still
listeners for that family:

Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0
LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0
Call Trace:
__netlink_clear_multicast_users+0x74/0xc0
genl_unregister_family+0xd4/0x2d0

Change the unsafe loop on the list to a safe one, because inside the
loop there is an element removal from this list.

Fixes: b8273570f802 ("genetlink: fix netns vs. netlink table locking (2)")
Cc: [email protected]
Signed-off-by: Anastasia Kovaleva <[email protected]>
Reviewed-by: Dmitry Bogdanov <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 822b5bc6 01-Oct-2024 Vadim Fedorenko <[email protected]>

net_tstamp: add SCM_TS_OPT_ID for RAW sockets

The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect)

net_tstamp: add SCM_TS_OPT_ID for RAW sockets

The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.

Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 4aecca4c 01-Oct-2024 Vadim Fedorenko <[email protected]>

net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there

net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc1, v6.11
# 8f0b3cc9 10-Sep-2024 Mina Almasry <[email protected]>

tcp: RX path for devmem TCP

In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvm

tcp: RX path for devmem TCP

In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.

tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.

tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:

1. the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.

The pages awaiting freeing are stored in the newly added
sk->sk_user_frags, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page. All pages are released when the socket is
destroyed.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Kaiyuan Zhang <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5
# 70d0bb45 22-Aug-2024 Simon Horman <[email protected]>

net: Correct spelling in headers

Correct spelling in Networking headers.
As reported by codespell.

Signed-off-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/20240822-net-spell-v

net: Correct spelling in headers

Correct spelling in Networking headers.
As reported by codespell.

Signed-off-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5
# ebad6d03 20-Jun-2024 Sebastian Andrzej Siewior <[email protected]>

net/ipv4: Use nested-BH locking for ipv4_tcp_sk.

ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data st

net/ipv4: Use nested-BH locking for ipv4_tcp_sk.

ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3
# b4cb4a13 04-Jun-2024 Eric Dumazet <[email protected]>

net: use unrcu_pointer() helper

Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat c

net: use unrcu_pointer() helper

Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Toke Høiland-Jørgensen <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


Revision tags: v6.10-rc2
# 92f1655a 28-May-2024 Eric Dumazet <[email protected]>

net: fix __dst_negative_advice() race

__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.

RCU rules are that we must first clear s

net: fix __dst_negative_advice() race

__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.

RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).

Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.

Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.

Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.

Many thanks to Clement Lecigne for tracking this issue.

This old bug became visible after the blamed commit, using UDP sockets.

Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets")
Reported-by: Clement Lecigne <[email protected]>
Diagnosed-by: Clement Lecigne <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Tom Herbert <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.10-rc1, v6.9
# 7951e36a 09-May-2024 Jens Axboe <[email protected]>

net: pass back whether socket was empty post accept

This adds an 'is_empty' argument to struct proto_accept_arg, which can
be used to pass back information on whether or not the given socket has
mor

net: pass back whether socket was empty post accept

This adds an 'is_empty' argument to struct proto_accept_arg, which can
be used to pass back information on whether or not the given socket has
more connections to accept post the one just accepted.

To utilize this information, the caller should initialize the 'is_empty'
field to, eg, -1 and then check for 0/1 after the accept. If the field
has been set, the caller knows whether there are more pending connections
or not. If the field remains -1 after the accept call, the protocol
doesn't support passing back this information.

This patch wires it up for ipv4/6 TCP.

Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

show more ...


12345678910>>...31