|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2 |
|
| #
0bb2f7a1 |
| 07-Apr-2025 |
Kuniyuki Iwashima <[email protected]> |
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-d
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds.
# ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
# lsmod | grep cifs cifs 1159168 0
This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO.
[0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt"
MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1
iptables -A INPUT -s ${CIFS_SERVER} -j DROP
for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done
rm -r ${MNT}
iptables -D INPUT -s ${CIFS_SERVER} -j DROP
[1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ...
BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4
Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
456cc675 |
| 28-Feb-2025 |
Geliang Tang <[email protected]> |
sock: add sock_kmemdup helper
This patch adds the sock version of kmemdup() helper, named sock_kmemdup(), to duplicate the input "src" memory block using the socket's option memory buffer.
Signed-o
sock: add sock_kmemdup helper
This patch adds the sock version of kmemdup() helper, named sock_kmemdup(), to duplicate the input "src" memory block using the socket's option memory buffer.
Signed-off-by: Geliang Tang <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Acked-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/f828077394c7d1f3560123497348b438c875b510.1740735165.git.tanggeliang@kylinos.cn Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
5c70eb5c |
| 20-Feb-2025 |
Eric Dumazet <[email protected]> |
net: better track kernel sockets lifetime
While kernel sockets are dismantled during pernet_operations->exit(), their freeing can be delayed by any tx packets still held in qdisc or device queues, d
net: better track kernel sockets lifetime
While kernel sockets are dismantled during pernet_operations->exit(), their freeing can be delayed by any tx packets still held in qdisc or device queues, due to skb_set_owner_w() prior calls.
This then trigger the following warning from ref_tracker_dir_exit() [1]
To fix this, make sure that kernel sockets own a reference on net->passive.
Add sk_net_refcnt_upgrade() helper, used whenever a kernel socket is converted to a refcounted one.
[1]
[ 136.263918][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at [ 136.263918][ T35] sk_alloc+0x2b3/0x370 [ 136.263918][ T35] inet6_create+0x6ce/0x10f0 [ 136.263918][ T35] __sock_create+0x4c0/0xa30 [ 136.263918][ T35] inet_ctl_sock_create+0xc2/0x250 [ 136.263918][ T35] igmp6_net_init+0x39/0x390 [ 136.263918][ T35] ops_init+0x31e/0x590 [ 136.263918][ T35] setup_net+0x287/0x9e0 [ 136.263918][ T35] copy_net_ns+0x33f/0x570 [ 136.263918][ T35] create_new_namespaces+0x425/0x7b0 [ 136.263918][ T35] unshare_nsproxy_namespaces+0x124/0x180 [ 136.263918][ T35] ksys_unshare+0x57d/0xa70 [ 136.263918][ T35] __x64_sys_unshare+0x38/0x40 [ 136.263918][ T35] do_syscall_64+0xf3/0x230 [ 136.263918][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 136.263918][ T35] [ 136.343488][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at [ 136.343488][ T35] sk_alloc+0x2b3/0x370 [ 136.343488][ T35] inet6_create+0x6ce/0x10f0 [ 136.343488][ T35] __sock_create+0x4c0/0xa30 [ 136.343488][ T35] inet_ctl_sock_create+0xc2/0x250 [ 136.343488][ T35] ndisc_net_init+0xa7/0x2b0 [ 136.343488][ T35] ops_init+0x31e/0x590 [ 136.343488][ T35] setup_net+0x287/0x9e0 [ 136.343488][ T35] copy_net_ns+0x33f/0x570 [ 136.343488][ T35] create_new_namespaces+0x425/0x7b0 [ 136.343488][ T35] unshare_nsproxy_namespaces+0x124/0x180 [ 136.343488][ T35] ksys_unshare+0x57d/0xa70 [ 136.343488][ T35] __x64_sys_unshare+0x38/0x40 [ 136.343488][ T35] do_syscall_64+0xf3/0x230 [ 136.343488][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 0cafd77dcd03 ("net: add a refcount tracker for kernel sockets") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/T/#u Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
df600f3b |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
This patch introduces a new bpf_skops_tx_timestamping() function that prepares the "struct bpf_sock_ops" ctx and then executes the
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
This patch introduces a new bpf_skops_tx_timestamping() function that prepares the "struct bpf_sock_ops" ctx and then executes the sockops BPF program.
The subsequent patch will utilize bpf_skops_tx_timestamping() at the existing TX timestamping kernel callbacks (__sk_tstamp_tx specifically) to call the sockops BPF program. Later, four callback points to report information to user space based on this patch will be introduced.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
24e82b7c |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Add networking timestamping support to bpf_get/setsockopt()
The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are added to bpf_get/setsockopt. The later patches will implement the BPF n
bpf: Add networking timestamping support to bpf_get/setsockopt()
The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are added to bpf_get/setsockopt. The later patches will implement the BPF networking timestamping. The BPF program will use bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to enable the BPF networking timestamping on a socket.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
c8802ded |
| 18-Feb-2025 |
Paolo Abeni <[email protected]> |
net: dismiss sk_forward_alloc_get()
After the previous patch we can remove the forward_alloc_get proto callback, basically reverting commit 292e6077b040 ("net: introduce sk_forward_alloc_get()") and
net: dismiss sk_forward_alloc_get()
After the previous patch we can remove the forward_alloc_get proto callback, basically reverting commit 292e6077b040 ("net: introduce sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use sk_forward_alloc_get() in sk_get_meminfo()").
Signed-off-by: Paolo Abeni <[email protected]> Acked-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-5-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
6ad86151 |
| 14-Feb-2025 |
Willem de Bruijn <[email protected]> |
net: initialize mark in sockcm_init
Avoid open coding initialization of sockcm fields. Avoid reading the sk_priority field twice.
This ensures all callers, existing and future, will correctly try a
net: initialize mark in sockcm_init
Avoid open coding initialization of sockcm fields. Avoid reading the sk_priority field twice.
This ensures all callers, existing and future, will correctly try a cmsg passed mark before sk_mark.
This patch extends support for cmsg mark to: packet_spkt and packet_tpacket and net/can/raw.c.
This patch extends support for cmsg priority to: packet_spkt and packet_tpacket.
Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
f0e70409 |
| 11-Feb-2025 |
Paolo Abeni <[email protected]> |
net: avoid unconditionally touching sk_tsflags on RX
After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"), the sk_tsflags field shares the same cacheline with sk_forward_alloc.
The UD
net: avoid unconditionally touching sk_tsflags on RX
After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"), the sk_tsflags field shares the same cacheline with sk_forward_alloc.
The UDP protocol does not acquire the sock lock in the RX path; forward allocations are protected via the receive queue spinlock; additionally udp_recvmsg() calls sock_recv_cmsgs() unconditionally touching sk_tsflags on each packet reception.
Due to the above, under high packet rate traffic, when the BH and the user-space process run on different CPUs, UDP packet reception experiences a cache miss while accessing sk_tsflags.
The receive path doesn't strictly need to access the problematic field; change sock_set_timestamping() to maintain the relevant information in a newly allocated sk_flags bit, so that sock_recv_cmsgs() can take decisions accessing the latter field only.
With this patch applied, on an AMD epic server with i40e NICs, I measured a 10% performance improvement for small packets UDP flood performance tests - possibly a larger delta could be observed with more recent H/W.
Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/dbd18c8a1171549f8249ac5a8b30b1b5ec88a425.1739294057.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7 |
|
| #
b2849867 |
| 07-Jan-2025 |
Oleg Nesterov <[email protected]> |
sock_poll_wait: kill the no longer necessary barrier after poll_wait()
Now that poll_wait() provides a full barrier we can remove smp_mb() from sock_poll_wait().
Also, the poll_does_not_wait() chec
sock_poll_wait: kill the no longer necessary barrier after poll_wait()
Now that poll_wait() provides a full barrier we can remove smp_mb() from sock_poll_wait().
Also, the poll_does_not_wait() check before poll_wait() just adds the unnecessary confusion, kill it. poll_wait() does the same "p && p->_qproc" check.
Signed-off-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
54f89b31 |
| 10-Dec-2024 |
Cong Wang <[email protected]> |
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
When bpf_tcp_ingress() is called, the skmsg is being redirected to the ingress of the destination socket. Therefore, we should charge its r
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
When bpf_tcp_ingress() is called, the skmsg is being redirected to the ingress of the destination socket. Therefore, we should charge its receive socket buffer, instead of sending socket buffer.
Because sk_rmem_schedule() tests pfmemalloc of skb, we need to introduce a wrapper and call it for skmsg.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
| #
e45469e5 |
| 13-Dec-2024 |
Anna Emese Nyiri <[email protected]> |
sock: Introduce SO_RCVPRIORITY socket option
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the ancillary data returned by recvmsg(). This is analogous to the existing support for
sock: Introduce SO_RCVPRIORITY socket option
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the ancillary data returned by recvmsg(). This is analogous to the existing support for SO_RCVMARK, as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()").
Reviewed-by: Willem de Bruijn <[email protected]> Suggested-by: Ferenc Fejes <[email protected]> Signed-off-by: Anna Emese Nyiri <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
a32f3e9d |
| 13-Dec-2024 |
Anna Emese Nyiri <[email protected]> |
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to th
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to this is IP_TOS, when the priority value is calculated during the handling of ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data"). However, this is a computed value, and there is currently no mechanism to set a custom priority via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data, the packet is sent with the priority value set through sockc->priority, overriding the socket-level values set via the traditional setsockopt() method. This is analogous to the existing support for SO_MARK, as implemented in commit c6af0c227a22 ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg IP_TOS.
Reviewed-by: Willem de Bruijn <[email protected]> Suggested-by: Ferenc Fejes <[email protected]> Signed-off-by: Anna Emese Nyiri <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
9c5bd93e |
| 13-Oct-2024 |
Michal Luczaj <[email protected]> |
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure to immediately and visibly fail the forwarding of unsu
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure to immediately and visibly fail the forwarding of unsupported af_vsock packets.
Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
| #
5ced52fa |
| 10-Oct-2024 |
Eric Dumazet <[email protected]> |
net: add skb_set_owner_edemux() helper
This can be used to attach a socket to an skb, taking a reference on sk->sk_refcnt.
This helper might be a NOP if sk->sk_refcnt is zero.
Use it from tcp_make
net: add skb_set_owner_edemux() helper
This can be used to attach a socket to an skb, taking a reference on sk->sk_refcnt.
This helper might be a NOP if sk->sk_refcnt is zero.
Use it from tcp_make_synack().
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Brian Vazquez <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
bc43a3c8 |
| 10-Oct-2024 |
Eric Dumazet <[email protected]> |
net_sched: sch_fq: prepare for TIME_WAIT sockets
TCP stack is not attaching skb to TIME_WAIT sockets yet, but we would like to allow this in the future.
Add sk_listener_or_tw() helper to detect the
net_sched: sch_fq: prepare for TIME_WAIT sockets
TCP stack is not attaching skb to TIME_WAIT sockets yet, but we would like to allow this in the future.
Add sk_listener_or_tw() helper to detect the three states that FQ needs to take care.
Like NEW_SYN_RECV, TIME_WAIT are not full sockets and do not contain sk->sk_pacing_status, sk->sk_pacing_rate.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Brian Vazquez <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2 |
|
| #
da5e06de |
| 05-Oct-2024 |
Jason Xing <[email protected]> |
net-timestamp: namespacify the sysctl_tstamp_allow_data
Let it be tuned in per netns by admins.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]>
net-timestamp: namespacify the sysctl_tstamp_allow_data
Let it be tuned in per netns by admins.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
1dae9f11 |
| 03-Oct-2024 |
Anastasia Kovaleva <[email protected]> |
net: Fix an unsafe loop on the list
The kernel may crash when deleting a genetlink family if there are still listeners for that family:
Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c0
net: Fix an unsafe loop on the list
The kernel may crash when deleting a genetlink family if there are still listeners for that family:
Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0 LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0 Call Trace: __netlink_clear_multicast_users+0x74/0xc0 genl_unregister_family+0xd4/0x2d0
Change the unsafe loop on the list to a safe one, because inside the loop there is an element removal from this list.
Fixes: b8273570f802 ("genetlink: fix netns vs. netlink table locking (2)") Cc: [email protected] Signed-off-by: Anastasia Kovaleva <[email protected]> Reviewed-by: Dmitry Bogdanov <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
822b5bc6 |
| 01-Oct-2024 |
Vadim Fedorenko <[email protected]> |
net_tstamp: add SCM_TS_OPT_ID for RAW sockets
The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW sockets. To add new option this patch converts all callers (direct and indirect)
net_tstamp: add SCM_TS_OPT_ID for RAW sockets
The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW sockets. To add new option this patch converts all callers (direct and indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of tsflags. And while here fix __sock_tx_timestamp to receive tsflags as __u32 instead of __u16.
Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Jason Xing <[email protected]> Signed-off-by: Vadim Fedorenko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
4aecca4c |
| 01-Oct-2024 |
Vadim Fedorenko <[email protected]> |
net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX timestamps and packets sent via socket. Unfortunately, there
net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX timestamps and packets sent via socket. Unfortunately, there is no way to reliably predict socket timestamp ID value in case of error returned by sendmsg. For UDP sockets it's impossible because of lockless nature of UDP transmit, several threads may send packets in parallel. In case of RAW sockets MSG_MORE option makes things complicated. More details are in the conversation [1]. This patch adds new control message type to give user-space software an opportunity to control the mapping between packets and values by providing ID with each sendmsg for UDP sockets. The documentation is also added in this patch.
[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Jason Xing <[email protected]> Signed-off-by: Vadim Fedorenko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11 |
|
| #
8f0b3cc9 |
| 10-Sep-2024 |
Mina Almasry <[email protected]> |
tcp: RX path for devmem TCP
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvm
tcp: RX path for devmem TCP
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling.
tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer.
tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information:
1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released.
The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed.
Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: Kaiyuan Zhang <[email protected]> Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
70d0bb45 |
| 22-Aug-2024 |
Simon Horman <[email protected]> |
net: Correct spelling in headers
Correct spelling in Networking headers. As reported by codespell.
Signed-off-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/20240822-net-spell-v
net: Correct spelling in headers
Correct spelling in Networking headers. As reported by codespell.
Signed-off-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
ebad6d03 |
| 20-Jun-2024 |
Sebastian Andrzej Siewior <[email protected]> |
net/ipv4: Use nested-BH locking for ipv4_tcp_sk.
ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data st
net/ipv4: Use nested-BH locking for ipv4_tcp_sk.
ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking.
Make a struct with a sock member (original ipv4_tcp_sk) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT.
Cc: David Ahern <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3 |
|
| #
b4cb4a13 |
| 04-Jun-2024 |
Eric Dumazet <[email protected]> |
net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing to remove some of the ugly casts we have when using xchg() for rcu protected pointers.
Also make inet_rcv_compat c
net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing to remove some of the ugly casts we have when using xchg() for rcu protected pointers.
Also make inet_rcv_compat const.
Signed-off-by: Eric Dumazet <[email protected]> Cc: Toke Høiland-Jørgensen <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2 |
|
| #
92f1655a |
| 28-May-2024 |
Eric Dumazet <[email protected]> |
net: fix __dst_negative_advice() race
__dst_negative_advice() does not enforce proper RCU rules when sk->dst_cache must be cleared, leading to possible UAF.
RCU rules are that we must first clear s
net: fix __dst_negative_advice() race
__dst_negative_advice() does not enforce proper RCU rules when sk->dst_cache must be cleared, leading to possible UAF.
RCU rules are that we must first clear sk->sk_dst_cache, then call dst_release(old_dst).
Note that sk_dst_reset(sk) is implementing this protocol correctly, while __dst_negative_advice() uses the wrong order.
Given that ip6_negative_advice() has special logic against RTF_CACHE, this means each of the three ->negative_advice() existing methods must perform the sk_dst_reset() themselves.
Note the check against NULL dst is centralized in __dst_negative_advice(), there is no need to duplicate it in various callbacks.
Many thanks to Clement Lecigne for tracking this issue.
This old bug became visible after the blamed commit, using UDP sockets.
Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets") Reported-by: Clement Lecigne <[email protected]> Diagnosed-by: Clement Lecigne <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Tom Herbert <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9 |
|
| #
7951e36a |
| 09-May-2024 |
Jens Axboe <[email protected]> |
net: pass back whether socket was empty post accept
This adds an 'is_empty' argument to struct proto_accept_arg, which can be used to pass back information on whether or not the given socket has mor
net: pass back whether socket was empty post accept
This adds an 'is_empty' argument to struct proto_accept_arg, which can be used to pass back information on whether or not the given socket has more connections to accept post the one just accepted.
To utilize this information, the caller should initialize the 'is_empty' field to, eg, -1 and then check for 0/1 after the accept. If the field has been set, the caller knows whether there are more pending connections or not. If the field remains -1 after the accept call, the protocol doesn't support passing back this information.
This patch wires it up for ipv4/6 TCP.
Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|