|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
f1a69a94 |
| 04-Apr-2025 |
Ricardo Cañuelo Navarro <[email protected]> |
sctp: detect and prevent references to a freed transport in sendmsg
sctp_sendmsg() re-uses associations and transports when possible by doing a lookup based on the socket endpoint and the message de
sctp: detect and prevent references to a freed transport in sendmsg
sctp_sendmsg() re-uses associations and transports when possible by doing a lookup based on the socket endpoint and the message destination address, and then sctp_sendmsg_to_asoc() sets the selected transport in all the message chunks to be sent.
There's a possible race condition if another thread triggers the removal of that selected transport, for instance, by explicitly unbinding an address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have been set up and before the message is sent. This can happen if the send buffer is full, during the period when the sender thread temporarily releases the socket lock in sctp_wait_for_sndbuf().
This causes the access to the transport data in sctp_outq_select_transport(), when the association outqueue is flushed, to result in a use-after-free read.
This change avoids this scenario by having sctp_transport_free() signal the freeing of the transport, tagging it as "dead". In order to do this, the patch restores the "dead" bit in struct sctp_transport, which was removed in commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport").
Then, in the scenario where the sender thread has released the socket lock in sctp_wait_for_sndbuf(), the bit is checked again after re-acquiring the socket lock to detect the deletion. This is done while holding a reference to the transport to prevent it from being freed in the process.
If the transport was deleted while the socket lock was relinquished, sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the send.
The bug was found by a private syzbot instance (see the error report [1] and the C reproducer that triggers it [2]).
Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt [1] Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c [2] Cc: [email protected] Fixes: df132eff4638 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer") Suggested-by: Xin Long <[email protected]> Signed-off-by: Ricardo Cañuelo Navarro <[email protected]> Acked-by: Xin Long <[email protected]> Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
4d5c70e6 |
| 07-Oct-2024 |
Xin Long <[email protected]> |
sctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_start
If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the h
sctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_start
If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible.
This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error.
Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket.
Fixes: 76c6d988aeb3 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2 |
|
| #
8beee4d8 |
| 30-Sep-2024 |
Xin Long <[email protected]> |
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
In sctp_listen_start() invoked by sctp_inet_listen(), it should set the sk_state back to CLOSED if sctp_autobind() fails due
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
In sctp_listen_start() invoked by sctp_inet_listen(), it should set the sk_state back to CLOSED if sctp_autobind() fails due to whatever reason.
Otherwise, next time when calling sctp_inet_listen(), if sctp_sk(sk)->reuse is already set via setsockopt(SCTP_REUSE_PORT), sctp_sk(sk)->bind_hash will be dereferenced as sk_state is LISTENING, which causes a crash as bind_hash is NULL.
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:sctp_inet_listen+0x7f0/0xa20 net/sctp/socket.c:8617 Call Trace: <TASK> __sys_listen_socket net/socket.c:1883 [inline] __sys_listen+0x1b7/0x230 net/socket.c:1894 __do_sys_listen net/socket.c:1902 [inline]
Fixes: 5e8f3f703ae4 ("sctp: simplify sctp listening code") Reported-by: [email protected] Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Link: https://patch.msgid.link/a93e655b3c153dc8945d7a812e6d8ab0d52b7aa0.1727729391.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7 |
|
| #
cda91d5b |
| 01-Jul-2024 |
Xin Long <[email protected]> |
sctp: cancel a blocking accept when shutdown a listen socket
As David Laight noticed,
"In a multithreaded program it is reasonable to have a thread blocked in accept(). With TCP a subsequent shutd
sctp: cancel a blocking accept when shutdown a listen socket
As David Laight noticed,
"In a multithreaded program it is reasonable to have a thread blocked in accept(). With TCP a subsequent shutdown(listen_fd, SHUT_RDWR) causes the accept to fail. But nothing happens for SCTP."
sctp_disconnect() is eventually called when shutdown a listen socket, but nothing is done in this function. This patch sets RCV_SHUTDOWN flag in sk->sk_shutdown there, and adds the check (sk->sk_shutdown & RCV_SHUTDOWN) to break and return in sctp_accept().
Note that shutdown() is only supported on TCP-style SCTP socket.
Reported-by: David Laight <[email protected]> Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9 |
|
| #
92ef0fd5 |
| 09-May-2024 |
Jens Axboe <[email protected]> |
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information.
No functional changes in this patch.
Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7, v6.9-rc6 |
|
| #
e5c5f359 |
| 27-Apr-2024 |
Erick Archer <[email protected]> |
sctp: prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2].
As the "ids" variable
sctp: prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2].
As the "ids" variable is a pointer to "struct sctp_assoc_ids" and this structure ends in a flexible array:
struct sctp_assoc_ids { [...] sctp_assoc_t gaids_assoc_id[]; };
the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the calculation "size + size * count" in the kmalloc() function.
Also, refactor the code adding the "ids_size" variable to avoid sizing twice.
This way, the code is more readable and safer.
This code was detected with the help of Coccinelle, and audited and modified manually.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <[email protected]> Acked-by: Xin Long <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/PAXPR02MB724871DB78375AB06B5171C88B152@PAXPR02MB7248.eurprd02.prod.outlook.com Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
1abe267f |
| 28-Mar-2024 |
Eric Dumazet <[email protected]> |
net: add sk_wake_async_rcu() helper
While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined.
This matters at least on AMD Zen1-4 platforms (see SRSO)
This might be be
net: add sk_wake_async_rcu() helper
While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined.
This matters at least on AMD Zen1-4 platforms (see SRSO)
This might be because rcu_read_lock() and rcu_read_unlock() are no longer nops in recent kernels ?
Add sk_wake_async_rcu() variant, which must be called from contexts already holding rcu lock.
As SOCK_FASYNC is deprecated in modern days, use unlikely() to give a hint to the compiler.
sk_wake_async_rcu() is properly inlined from __udp_enqueue_schedule_skb() and sock_def_readable().
Signed-off-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1, v6.8 |
|
| #
490a79fa |
| 06-Mar-2024 |
Eric Dumazet <[email protected]> |
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <[email protected]
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7 |
|
| #
a562c0a2 |
| 19-Dec-2023 |
Eric Dumazet <[email protected]> |
sctp: fix busy polling
Busy polling while holding the socket lock makes litle sense, because incoming packets wont reach our receive queue.
Fixes: 8465a5fcd1ce ("sctp: add support for busy polling
sctp: fix busy polling
Busy polling while holding the socket lock makes litle sense, because incoming packets wont reach our receive queue.
Fixes: 8465a5fcd1ce ("sctp: add support for busy polling to sctp protocol") Reported-by: Jacob Moroni <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Marcelo Ricardo Leitner <[email protected]> Cc: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc6 |
|
| #
4746b36b |
| 12-Dec-2023 |
Eric Dumazet <[email protected]> |
sctp: support MSG_ERRQUEUE flag in recvmsg()
For some reason sctp_poll() generates EPOLLERR if sk->sk_error_queue is not empty but recvmsg() can not drain the error queue yet.
This is needed to bet
sctp: support MSG_ERRQUEUE flag in recvmsg()
For some reason sctp_poll() generates EPOLLERR if sk->sk_error_queue is not empty but recvmsg() can not drain the error queue yet.
This is needed to better support timestamping.
I had to export inet_recv_error(), since sctp can be compiled as a module.
Signed-off-by: Eric Dumazet <[email protected]> Cc: Marcelo Ricardo Leitner <[email protected]> Cc: Willem de Bruijn <[email protected]> Acked-by: Xin Long <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4 |
|
| #
1f4e803c |
| 01-Oct-2023 |
Xin Long <[email protected]> |
sctp: update hb timer immediately after users change hb_interval
Currently, when hb_interval is changed by users, it won't take effect until the next expiry of hb timer. As the default value is 30s,
sctp: update hb timer immediately after users change hb_interval
Currently, when hb_interval is changed by users, it won't take effect until the next expiry of hb timer. As the default value is 30s, users have to wait up to 30s to wait its hb_interval update to work.
This becomes pretty bad in containers where a much smaller value is usually set on hb_interval. This patch improves it by resetting the hb timer immediately once the value of hb_interval is updated by users.
Note that we don't address the already existing 'problem' when sending a heartbeat 'on demand' if one hb has just been sent(from the timer) mentioned in:
https://www.mail-archive.com/[email protected]/msg590224.html
Signed-off-by: Xin Long <[email protected]> Reviewed-by: Simon Horman <[email protected]> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc3, v6.6-rc2, v6.6-rc1 |
|
| #
dc9511dd |
| 30-Aug-2023 |
Eric Dumazet <[email protected]> |
sctp: annotate data-races around sk->sk_wmem_queued
sk->sk_wmem_queued can be read locklessly from sctp_poll()
Use sk_wmem_queued_add() when the field is changed, and add READ_ONCE() annotations in
sctp: annotate data-races around sk->sk_wmem_queued
sk->sk_wmem_queued can be read locklessly from sctp_poll()
Use sk_wmem_queued_add() when the field is changed, and add READ_ONCE() annotations in sctp_writeable() and sctp_assocs_seq_show()
syzbot reported:
BUG: KCSAN: data-race in sctp_poll / sctp_wfree
read-write to 0xffff888149d77810 of 4 bytes by interrupt on cpu 0: sctp_wfree+0x170/0x4a0 net/sctp/socket.c:9147 skb_release_head_state+0xb7/0x1a0 net/core/skbuff.c:988 skb_release_all net/core/skbuff.c:1000 [inline] __kfree_skb+0x16/0x140 net/core/skbuff.c:1016 consume_skb+0x57/0x180 net/core/skbuff.c:1232 sctp_chunk_destroy net/sctp/sm_make_chunk.c:1503 [inline] sctp_chunk_put+0xcd/0x130 net/sctp/sm_make_chunk.c:1530 sctp_datamsg_put+0x29a/0x300 net/sctp/chunk.c:128 sctp_chunk_free+0x34/0x50 net/sctp/sm_make_chunk.c:1515 sctp_outq_sack+0xafa/0xd70 net/sctp/outqueue.c:1381 sctp_cmd_process_sack net/sctp/sm_sideeffect.c:834 [inline] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1366 [inline] sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x12c7/0x31b0 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x2b2/0x430 net/sctp/associola.c:1051 sctp_inq_push+0x108/0x120 net/sctp/inqueue.c:80 sctp_rcv+0x116e/0x1340 net/sctp/input.c:243 sctp6_rcv+0x25/0x40 net/sctp/ipv6.c:1120 ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437 ip6_input_finish net/ipv6/ip6_input.c:482 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79 NF_HOOK include/linux/netfilter.h:303 [inline] ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core net/core/dev.c:5452 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566 process_backlog+0x21f/0x380 net/core/dev.c:5894 __napi_poll+0x60/0x3b0 net/core/dev.c:6460 napi_poll net/core/dev.c:6527 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6660 __do_softirq+0xc1/0x265 kernel/softirq.c:553 run_ksoftirqd+0x17/0x20 kernel/softirq.c:921 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:389 ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
read to 0xffff888149d77810 of 4 bytes by task 17828 on cpu 1: sctp_writeable net/sctp/socket.c:9304 [inline] sctp_poll+0x265/0x410 net/sctp/socket.c:8671 sock_poll+0x253/0x270 net/socket.c:1374 vfs_poll include/linux/poll.h:88 [inline] do_pollfd fs/select.c:873 [inline] do_poll fs/select.c:921 [inline] do_sys_poll+0x636/0xc00 fs/select.c:1015 __do_sys_ppoll fs/select.c:1121 [inline] __se_sys_ppoll+0x1af/0x1f0 fs/select.c:1101 __x64_sys_ppoll+0x67/0x80 fs/select.c:1101 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x00019e80 -> 0x0000cc80
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17828 Comm: syz-executor.1 Not tainted 6.5.0-rc7-syzkaller-00185-g28f20a19294d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Marcelo Ricardo Leitner <[email protected]> Acked-by: Xin Long <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7 |
|
| #
f866fbc8 |
| 19-Aug-2023 |
Eric Dumazet <[email protected]> |
ipv4: fix data-races around inet->inet_id
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1]
Convert inet->inet_id to an atomic_t, but implement a dedi
ipv4: fix data-races around inet->inet_id
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1]
Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return())
Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree.
v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern)
[1]
BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb
read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x184d -> 0x184e
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ==================================================================
Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
76f33296 |
| 18-Aug-2023 |
Eric Dumazet <[email protected]> |
sock: annotate data-races around prot->memory_pressure
*prot->memory_pressure is read/writen locklessly, we need to add proper annotations.
A recent commit added a new race, it is time to audit all
sock: annotate data-races around prot->memory_pressure
*prot->memory_pressure is read/writen locklessly, we need to add proper annotations.
A recent commit added a new race, it is time to audit all accesses.
Fixes: 2d0c88e84e48 ("sock: Fix misuse of sk_under_memory_pressure()") Fixes: 4d93df0abd50 ("[SCTP]: Rewrite of sctp buffer management code") Signed-off-by: Eric Dumazet <[email protected]> Cc: Abel Wu <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
b09bde5c |
| 16-Aug-2023 |
Eric Dumazet <[email protected]> |
inet: move inet->mc_loop to inet->inet_frags
IP_MULTICAST_LOOP socket option can now be set/read without locking the socket.
v3: fix build bot error reported in ipvs set_mcast_loop()
Signed-off-by
inet: move inet->mc_loop to inet->inet_frags
IP_MULTICAST_LOOP socket option can now be set/read without locking the socket.
v3: fix build bot error reported in ipvs set_mcast_loop()
Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3 |
|
| #
f5f80e32 |
| 20-Jul-2023 |
Eric Dumazet <[email protected]> |
ipv6: remove hard coded limitation on ipv6_pinfo
IPv6 inet sockets are supposed to have a "struct ipv6_pinfo" field at the end of their definition, so that inet6_sk_generic() can derive from socket
ipv6: remove hard coded limitation on ipv6_pinfo
IPv6 inet sockets are supposed to have a "struct ipv6_pinfo" field at the end of their definition, so that inet6_sk_generic() can derive from socket size the offset of the "struct ipv6_pinfo".
This is very fragile, and prevents adding bigger alignment in sockets, because inet6_sk_generic() does not work if the compiler adds padding after the ipv6_pinfo component.
We are currently working on a patch series to reorganize TCP structures for better data locality and found issues similar to the one fixed in commit f5d547676ca0 ("tcp: fix tcp_inet6_sk() for 32bit kernels")
Alternative would be to force an alignment on "struct ipv6_pinfo", greater or equal to __alignof__(any ipv6 sock) to ensure there is no padding. This does not look great.
v2: fix typo in mptcp_proto_v6_init() (Paolo)
Signed-off-by: Eric Dumazet <[email protected]> Cc: Chao Wu <[email protected]> Cc: Wei Wang <[email protected]> Cc: Coco Li <[email protected]> Cc: YiFei Zhu <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1 |
|
| #
6feb37b3 |
| 27-Jun-2023 |
Chengfeng Ye <[email protected]> |
sctp: fix potential deadlock on &net->sctp.addr_wq_lock
As &net->sctp.addr_wq_lock is also acquired by the timer sctp_addr_wq_timeout_handler() in protocal.c, the same lock acquisition at sctp_auto_
sctp: fix potential deadlock on &net->sctp.addr_wq_lock
As &net->sctp.addr_wq_lock is also acquired by the timer sctp_addr_wq_timeout_handler() in protocal.c, the same lock acquisition at sctp_auto_asconf_init() seems should disable irq since it is called from sctp_accept() under process context.
Possible deadlock scenario: sctp_accept() -> sctp_sock_migrate() -> sctp_auto_asconf_init() -> spin_lock(&net->sctp.addr_wq_lock) <timer interrupt> -> sctp_addr_wq_timeout_handler() -> spin_lock_bh(&net->sctp.addr_wq_lock); (deadlock here)
This flaw was found using an experimental static analysis tool we are developing for irq-related deadlock.
The tentative patch fix the potential deadlock by spin_lock_bh().
Signed-off-by: Chengfeng Ye <[email protected]> Fixes: 34e5b0118685 ("sctp: delay auto_asconf init until binding the first addr") Acked-by: Xin Long <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
e1d001fa |
| 09-Jun-2023 |
Breno Leitao <[email protected]> |
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and operate on kernel buffers, i.e., all protocol's ioctl callbacks is adapted to operate on a kernel memory other than on userspace (so, no more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg);
(Important to say that this patch does not touch the "struct proto_ops" protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a pointer allocated to kernel space memory (inside a function wrapper). This buffer (karg) may contain input argument (copied from userspace in a prep function) and it might return a value/buffer, which is copied back to userspace if necessary. There is not one-size-fits-all format (that is I am using 'may' above), but basically, there are three type of ioctls:
1) Do not read from userspace, returns a result to userspace 2) Read an input parameter from userspace, and does not return anything to userspace 3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is returned to userspace) encompasses more than 90% of the cases, but there are two other exceptions. Here is a list of exceptions:
* Protocol RAW: * cmd = SIOCGETVIFCNT: * input and output = struct sioc_vif_req * cmd = SIOCGETSGCNT * input and output = struct sioc_sg_req * Explanation: for the SIOCGETVIFCNT case, userspace passes the input argument, which is struct sioc_vif_req. Then the callback populates the struct, which is copied back to userspace.
* Protocol RAW6: * cmd = SIOCGETMIFCNT_IN6 * input and output = struct sioc_mif_req6 * cmd = SIOCGETSGCNT_IN6 * input and output = struct sioc_sg_req6
* Protocol PHONET: * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE * input int (4 bytes) * Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2 |
|
| #
2598619e |
| 11-May-2023 |
Alexander Mikhalitsyn <[email protected]> |
sctp: add bpf_bypass_getsockopt proto callback
Implement ->bpf_bypass_getsockopt proto callback and filter out SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3 socket opti
sctp: add bpf_bypass_getsockopt proto callback
Implement ->bpf_bypass_getsockopt proto callback and filter out SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3 socket options from running eBPF hook on them.
SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(), and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of the original handler sctp_getsockopt(...), userspace will receive an error from getsockopt syscall and will be not aware that fd was successfully installed into a fdtable.
As pointed by Marcelo Ricardo Leitner it seems reasonable to skip bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too. Because internaly, it triggers connect() and if error is masked then userspace will be confused.
This patch was born as a result of discussion around a new SCM_PIDFD interface: https://lore.kernel.org/all/[email protected]/
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Cc: Daniel Borkmann <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Neil Horman <[email protected]> Cc: Marcelo Ricardo Leitner <[email protected]> Cc: Xin Long <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Suggested-by: Stanislav Fomichev <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]> Acked-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc1, v6.3, v6.3-rc7 |
|
| #
ab4f1e28 |
| 14-Apr-2023 |
Xin Long <[email protected]> |
sctp: add intl_capable and reconf_capable in ss peer_capable
There are two new peer capables have been added since sctp_diag was introduced into SCTP. When dumping the peer capables, these two new p
sctp: add intl_capable and reconf_capable in ss peer_capable
There are two new peer capables have been added since sctp_diag was introduced into SCTP. When dumping the peer capables, these two new peer capables should also be included. To not break the old capables, reconf_capable takes the old hostname_address bit, and intl_capable uses the higher available bit in sctpi_peer_capable.
Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
bd4b2818 |
| 14-Apr-2023 |
Xin Long <[email protected]> |
sctp: delete the obsolete code for the host name address param
In the latest RFC9260, the Host Name Address param has been deprecated. For INIT chunk:
Note 3: An INIT chunk MUST NOT contain the H
sctp: delete the obsolete code for the host name address param
In the latest RFC9260, the Host Name Address param has been deprecated. For INIT chunk:
Note 3: An INIT chunk MUST NOT contain the Host Name Address parameter. The receiver of an INIT chunk containing a Host Name Address parameter MUST send an ABORT chunk and MAY include an "Unresolvable Address" error cause.
For Supported Address Types:
The value indicating the Host Name Address parameter MUST NOT be used when sending this parameter and MUST be ignored when receiving this parameter.
Currently Linux SCTP doesn't really support Host Name Address param, but only saves some flag and print debug info, which actually won't even be triggered due to the verification in sctp_verify_param(). This patch is to delete those dead code.
Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc6, v6.3-rc5 |
|
| #
2584024b |
| 01-Apr-2023 |
Xin Long <[email protected]> |
sctp: check send stream number after wait_for_sndbuf
This patch fixes a corner case where the asoc out stream count may change after wait_for_sndbuf.
When the main thread in the client starts a con
sctp: check send stream number after wait_for_sndbuf
This patch fixes a corner case where the asoc out stream count may change after wait_for_sndbuf.
When the main thread in the client starts a connection, if its out stream count is set to N while the in stream count in the server is set to N - 2, another thread in the client keeps sending the msgs with stream number N - 1, and waits for sndbuf before processing INIT_ACK.
However, after processing INIT_ACK, the out stream count in the client is shrunk to N - 2, the same to the in stream count in the server. The crash occurs when the thread waiting for sndbuf is awake and sends the msg in a non-existing stream(N - 1), the call trace is as below:
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] Call Trace: <TASK> sctp_cmd_send_msg net/sctp/sm_sideeffect.c:1114 [inline] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1777 [inline] sctp_side_effects net/sctp/sm_sideeffect.c:1199 [inline] sctp_do_sm+0x197d/0x5310 net/sctp/sm_sideeffect.c:1170 sctp_primitive_SEND+0x9f/0xc0 net/sctp/primitive.c:163 sctp_sendmsg_to_asoc+0x10eb/0x1a30 net/sctp/socket.c:1868 sctp_sendmsg+0x8d4/0x1d90 net/sctp/socket.c:2026 inet_sendmsg+0x9d/0xe0 net/ipv4/af_inet.c:825 sock_sendmsg_nosec net/socket.c:722 [inline] sock_sendmsg+0xde/0x190 net/socket.c:745
The fix is to add an unlikely check for the send stream number after the thread wakes up from the wait_for_sndbuf.
Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: [email protected] Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6 |
|
| #
91d0b78c |
| 24-Jan-2023 |
Jakub Sitnicki <[email protected]> |
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires sta
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port.
An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used only if:
- there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing.
PORT_LO = 40_000 PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc5 |
|
| #
40e0b090 |
| 20-Jan-2023 |
Peilin Ye <[email protected]> |
net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example:
<...> iperf-609 [002] ..... 70.660425: s
net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example:
<...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...>
Suggested-by: Cong Wang <[email protected]> Signed-off-by: Peilin Ye <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6 |
|
| #
0af03170 |
| 16-Nov-2022 |
Xin Long <[email protected]> |
sctp: add dif and sdif check in asoc and ep lookup
This patch at first adds a pernet global l3mdev_accept to decide if it accepts the packets from a l3mdev when a SCTP socket doesn't bind to any int
sctp: add dif and sdif check in asoc and ep lookup
This patch at first adds a pernet global l3mdev_accept to decide if it accepts the packets from a l3mdev when a SCTP socket doesn't bind to any interface. It's set to 1 to avoid any possible incompatible issue, and in next patch, a sysctl will be introduced to allow to change it.
Then similar to inet/udp_sk_bound_dev_eq(), sctp_sk_bound_dev_eq() is added to check either dif or sdif is equal to sk_bound_dev_if, and to check sid is 0 or l3mdev_accept is 1 if sk_bound_dev_if is not set. This function is used to match a association or a endpoint, namely called by sctp_addrs_lookup_transport() and sctp_endpoint_is_match(). All functions that needs updating are:
sctp_rcv(): asoc: __sctp_rcv_lookup() __sctp_lookup_association() -> sctp_addrs_lookup_transport() __sctp_rcv_lookup_harder() __sctp_rcv_init_lookup() __sctp_lookup_association() -> sctp_addrs_lookup_transport() __sctp_rcv_walk_lookup() __sctp_rcv_asconf_lookup() __sctp_lookup_association() -> sctp_addrs_lookup_transport()
ep: __sctp_rcv_lookup_endpoint() -> sctp_endpoint_is_match()
sctp_connect(): sctp_endpoint_is_peeled_off() __sctp_lookup_association() sctp_has_association() sctp_lookup_association() __sctp_lookup_association() -> sctp_addrs_lookup_transport()
sctp_diag_dump_one(): sctp_transport_lookup_process() -> sctp_addrs_lookup_transport()
Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|