|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
c353e898 |
| 20-Mar-2025 |
Paolo Abeni <[email protected]> |
net: introduce per netns packet chains
Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces.
Add per netn
net: introduce per netns packet chains
Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces.
Add per netns ptypes chains, so that in the mentioned case only the netns owning the packet socket(s) is affected.
While at that drop the global ptype_all list: no in kernel user registers a tap on "any" type without specifying either the target device or the target namespace (and IMHO doing that would not make any sense).
Note that this adds a conditional in the fast path (to check for per netns ptype_specific list) and increases the dataset size by a cacheline (owing the per netns lists).
Reviewed-by: Sabrina Dubroca <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4 |
|
| #
e57a6320 |
| 17-Feb-2025 |
Kuniyuki Iwashima <[email protected]> |
net: Add net_passive_inc() and net_passive_dec().
net_drop_ns() is NULL when CONFIG_NET_NS is disabled.
The next patch introduces a function that increments and decrements net->passive.
As a prep,
net: Add net_passive_inc() and net_passive_dec().
net_drop_ns() is NULL when CONFIG_NET_NS is disabled.
The next patch introduces a function that increments and decrements net->passive.
As a prep, let's rename and export net_free() to net_passive_dec() and add net_passive_inc().
Suggested-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/netdev/CANn89i+oUCt2VGvrbrweniTendZFEh+nwS=uonc004-aPkWy-Q@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3, v6.14-rc2 |
|
| #
482ad2a4 |
| 05-Feb-2025 |
Eric Dumazet <[email protected]> |
net: add dev_net_rcu() helper
dev->nd_net can change, readers should either use rcu_read_lock() or RTNL.
We currently use a generic helper, dev_net() with no debugging support. We probably have man
net: add dev_net_rcu() helper
dev->nd_net can change, readers should either use rcu_read_lock() or RTNL.
We currently use a generic helper, dev_net() with no debugging support. We probably have many hidden bugs.
Add dev_net_rcu() helper for callers using rcu_read_lock() protection.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13 |
|
| #
0734d7c3 |
| 14-Jan-2025 |
Eric Dumazet <[email protected]> |
net: expedite synchronize_net() for cleanup_net()
cleanup_net() is the single thread responsible for netns dismantles, and a serious bottleneck.
Before we can get per-netns RTNL, make sure all sync
net: expedite synchronize_net() for cleanup_net()
cleanup_net() is the single thread responsible for netns dismantles, and a serious bottleneck.
Before we can get per-netns RTNL, make sure all synchronize_net() called from this thread are using rcu_synchronize_expedited().
v3: deal with CONFIG_NET_NS=n
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jesse Brandeburg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
0f6ede9f |
| 04-Dec-2024 |
Eric Dumazet <[email protected]> |
net: defer final 'struct net' free in netns dismantle
Ilya reported a slab-use-after-free in dst_destroy [1]
Issue is in xfrm6_net_init() and xfrm4_net_init() :
They copy xfrm[46]_dst_ops_template
net: defer final 'struct net' free in netns dismantle
Ilya reported a slab-use-after-free in dst_destroy [1]
Issue is in xfrm6_net_init() and xfrm4_net_init() :
They copy xfrm[46]_dst_ops_template into net->xfrm.xfrm[46]_dst_ops.
But net structure might be freed before all the dst callbacks are called. So when dst_destroy() calls later :
if (dst->ops->destroy) dst->ops->destroy(dst);
dst->ops points to the old net->xfrm.xfrm[46]_dst_ops, which has been freed.
See a relevant issue fixed in :
ac888d58869b ("net: do not delay dst_entries_add() in dst_release()")
A fix is to queue the 'struct net' to be freed after one another cleanup_net() round (and existing rcu_barrier())
[1]
BUG: KASAN: slab-use-after-free in dst_destroy (net/core/dst.c:112) Read of size 8 at addr ffff8882137ccab0 by task swapper/37/0 Dec 03 05:46:18 kernel: CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 6.12.0 #67 Hardware name: Red Hat KVM/RHEL, BIOS 1.16.1-1.el9 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:124) print_address_description.constprop.0 (mm/kasan/report.c:378) ? dst_destroy (net/core/dst.c:112) print_report (mm/kasan/report.c:489) ? dst_destroy (net/core/dst.c:112) ? kasan_addr_to_slab (mm/kasan/common.c:37) kasan_report (mm/kasan/report.c:603) ? dst_destroy (net/core/dst.c:112) ? rcu_do_batch (kernel/rcu/tree.c:2567) dst_destroy (net/core/dst.c:112) rcu_do_batch (kernel/rcu/tree.c:2567) ? __pfx_rcu_do_batch (kernel/rcu/tree.c:2491) ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4339 kernel/locking/lockdep.c:4406) rcu_core (kernel/rcu/tree.c:2825) handle_softirqs (kernel/softirq.c:554) __irq_exit_rcu (kernel/softirq.c:589 kernel/softirq.c:428 kernel/softirq.c:637) irq_exit_rcu (kernel/softirq.c:651) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) </IRQ> <TASK> asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:702) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:743) Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 6e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 0f 00 2d c7 c9 27 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 RSP: 0018:ffff888100d2fe00 EFLAGS: 00000246 RAX: 00000000001870ed RBX: 1ffff110201a5fc2 RCX: ffffffffb61a3e46 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb3d4d123 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed11c7e1835d R10: ffff888e3f0c1aeb R11: 0000000000000000 R12: 0000000000000000 R13: ffff888100d20000 R14: dffffc0000000000 R15: 0000000000000000 ? ct_kernel_exit.constprop.0 (kernel/context_tracking.c:148) ? cpuidle_idle_call (kernel/sched/idle.c:186) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) cpuidle_idle_call (kernel/sched/idle.c:186) ? __pfx_cpuidle_idle_call (kernel/sched/idle.c:168) ? lock_release (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5848) ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4347 kernel/locking/lockdep.c:4406) ? tsc_verify_tsc_adjust (arch/x86/kernel/tsc_sync.c:59) do_idle (kernel/sched/idle.c:326) cpu_startup_entry (kernel/sched/idle.c:423 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:202 arch/x86/kernel/smpboot.c:282) ? __pfx_start_secondary (arch/x86/kernel/smpboot.c:232) ? soft_restart_cpu (arch/x86/kernel/head_64.S:452) common_startup_64 (arch/x86/kernel/head_64.S:414) </TASK> Dec 03 05:46:18 kernel: Allocated by task 12184: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (./arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345) kmem_cache_alloc_noprof (mm/slub.c:4085 mm/slub.c:4134 mm/slub.c:4141) copy_net_ns (net/core/net_namespace.c:421 net/core/net_namespace.c:480) create_new_namespaces (kernel/nsproxy.c:110) unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4)) ksys_unshare (kernel/fork.c:3313) __x64_sys_unshare (kernel/fork.c:3382) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Dec 03 05:46:18 kernel: Freed by task 11: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (./arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) kasan_save_free_info (mm/kasan/generic.c:582) __kasan_slab_free (mm/kasan/common.c:271) kmem_cache_free (mm/slub.c:4579 mm/slub.c:4681) cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:647) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) Dec 03 05:46:18 kernel: Last potentially related work creation: kasan_save_stack (mm/kasan/common.c:48) __kasan_record_aux_stack (mm/kasan/generic.c:541) insert_work (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 kernel/workqueue.c:788 kernel/workqueue.c:795 kernel/workqueue.c:2186) __queue_work (kernel/workqueue.c:2340) queue_work_on (kernel/workqueue.c:2391) xfrm_policy_insert (net/xfrm/xfrm_policy.c:1610) xfrm_add_policy (net/xfrm/xfrm_user.c:2116) xfrm_user_rcv_msg (net/xfrm/xfrm_user.c:3321) netlink_rcv_skb (net/netlink/af_netlink.c:2536) xfrm_netlink_rcv (net/xfrm/xfrm_user.c:3344) netlink_unicast (net/netlink/af_netlink.c:1316 net/netlink/af_netlink.c:1342) netlink_sendmsg (net/netlink/af_netlink.c:1886) sock_write_iter (net/socket.c:729 net/socket.c:744 net/socket.c:1165) vfs_write (fs/read_write.c:590 fs/read_write.c:683) ksys_write (fs/read_write.c:736) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Dec 03 05:46:18 kernel: Second to last potentially related work creation: kasan_save_stack (mm/kasan/common.c:48) __kasan_record_aux_stack (mm/kasan/generic.c:541) insert_work (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 kernel/workqueue.c:788 kernel/workqueue.c:795 kernel/workqueue.c:2186) __queue_work (kernel/workqueue.c:2340) queue_work_on (kernel/workqueue.c:2391) __xfrm_state_insert (./include/linux/workqueue.h:723 net/xfrm/xfrm_state.c:1150 net/xfrm/xfrm_state.c:1145 net/xfrm/xfrm_state.c:1513) xfrm_state_update (./include/linux/spinlock.h:396 net/xfrm/xfrm_state.c:1940) xfrm_add_sa (net/xfrm/xfrm_user.c:912) xfrm_user_rcv_msg (net/xfrm/xfrm_user.c:3321) netlink_rcv_skb (net/netlink/af_netlink.c:2536) xfrm_netlink_rcv (net/xfrm/xfrm_user.c:3344) netlink_unicast (net/netlink/af_netlink.c:1316 net/netlink/af_netlink.c:1342) netlink_sendmsg (net/netlink/af_netlink.c:1886) sock_write_iter (net/socket.c:729 net/socket.c:744 net/socket.c:1165) vfs_write (fs/read_write.c:590 fs/read_write.c:683) ksys_write (fs/read_write.c:736) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Fixes: a8a572a6b5f2 ("xfrm: dst_entries_init() per-net dst_ops") Reported-by: Ilya Maximets <[email protected]> Closes: https://lore.kernel.org/netdev/CANn89iKKYDVpB=MtmfH7nyv2p=rJWSLedO5k7wSZgtY_tO8WQg@mail.gmail.com/T/#m02c98c3009fe66382b73cfb4db9cf1df6fab3fbf Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Paolo Abeni <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
50b94204 |
| 03-Dec-2024 |
Paolo Abeni <[email protected]> |
ipmr: tune the ipmr_can_free_table() checks.
Eric reported a syzkaller-triggered splat caused by recent ipmr changes:
WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419 ip6mr_free_table+0xbd/0x120 n
ipmr: tune the ipmr_can_free_table() checks.
Eric reported a syzkaller-triggered splat caused by recent ipmr changes:
WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419 ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Modules linked in: CPU: 2 UID: 0 PID: 6041 Comm: syz-executor183 Not tainted 6.12.0-syzkaller-10681-g65ae975e97d5 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Code: 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 58 49 83 bc 24 c0 0e 00 00 00 74 09 e8 44 ef a9 f7 90 <0f> 0b 90 e8 3b ef a9 f7 48 8d 7b 38 e8 12 a3 96 f7 48 89 df be 0f RSP: 0018:ffffc90004267bd8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88803c710000 RCX: ffffffff89e4d844 RDX: ffff88803c52c880 RSI: ffffffff89e4d87c RDI: ffff88803c578ec0 RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803c578000 R13: ffff88803c710000 R14: ffff88803c710008 R15: dead000000000100 FS: 00007f7a855ee6c0(0000) GS:ffff88806a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7a85689938 CR3: 000000003c492000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip6mr_rules_exit+0x176/0x2d0 net/ipv6/ip6mr.c:283 ip6mr_net_exit_batch+0x53/0xa0 net/ipv6/ip6mr.c:1388 ops_exit_list+0x128/0x180 net/core/net_namespace.c:177 setup_net+0x4fe/0x860 net/core/net_namespace.c:394 copy_net_ns+0x2b4/0x6b0 net/core/net_namespace.c:500 create_new_namespaces+0x3ea/0xad0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 ksys_unshare+0x45d/0xa40 kernel/fork.c:3334 __do_sys_unshare kernel/fork.c:3405 [inline] __se_sys_unshare kernel/fork.c:3403 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3403 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f7a856332d9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f7a855ee238 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f7a856bd308 RCX: 00007f7a856332d9 RDX: 00007f7a8560f8c6 RSI: 0000000000000000 RDI: 0000000062040200 RBP: 00007f7a856bd300 R08: 00007fff932160a7 R09: 00007f7a855ee6c0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a856bd30c R13: 0000000000000000 R14: 00007fff93215fc0 R15: 00007fff932160a8 </TASK>
The root cause is a network namespace creation failing after successful initialization of the ipmr subsystem. Such a case is not currently matched by the ipmr_can_free_table() helper.
New namespaces are zeroed on allocation and inserted into net ns list only after successful creation; when deleting an ipmr table, the list next pointer can be NULL only on netns initialization failure.
Update the ipmr_can_free_table() checks leveraging such condition.
Reported-by: Eric Dumazet <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=6e8cb445d4b43d006e0c Fixes: 11b6e701bce9 ("ipmr: add debug check for mr table cleanup") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/8bde975e21bbca9d9c27e36209b2dd4f1d7a3f00.1733212078.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
76aed953 |
| 04-Oct-2024 |
Kuniyuki Iwashima <[email protected]> |
rtnetlink: Add per-netns RTNL.
The goal is to break RTNL down into per-netns mutex.
This patch adds per-netns mutex and its helper functions, rtnl_net_lock() and rtnl_net_unlock().
rtnl_net_lock()
rtnetlink: Add per-netns RTNL.
The goal is to break RTNL down into per-netns mutex.
This patch adds per-netns mutex and its helper functions, rtnl_net_lock() and rtnl_net_unlock().
rtnl_net_lock() acquires the global RTNL and per-netns RTNL mutex, and rtnl_net_unlock() releases them.
We will replace 800+ rtnl_lock() with rtnl_net_lock() and finally removes rtnl_lock() in rtnl_net_lock().
When we need to nest per-netns RTNL mutex, we will use __rtnl_net_lock(), and its locking order is defined by rtnl_net_lock_cmp_fn() as follows:
1. init_net is first 2. netns address ascending order
Note that the conversion will be done under CONFIG_DEBUG_NET_SMALL_RTNL with LOCKDEP so that we can carefully add the extra mutex without slowing down RTNL operations during conversion.
Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
768e4bb6 |
| 31-Jul-2024 |
Kuniyuki Iwashima <[email protected]> |
net: Don't register pernet_operations if only one of id or size is specified.
We can allocate per-netns memory for struct pernet_operations by specifying id and size.
register_pernet_operations() a
net: Don't register pernet_operations if only one of id or size is specified.
We can allocate per-netns memory for struct pernet_operations by specifying id and size.
register_pernet_operations() assigns an id to pernet_operations and later ops_init() allocates the specified size of memory as net->gen->ptr[id].
If id is missing, no memory is allocated. If size is not specified, pernet_operations just wastes an entry of net->gen->ptr[] for every netns.
net_generic is available only when both id and size are specified, so let's ensure that.
While we are at it, we add const to both fields.
Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4 |
|
| #
fd4f101e |
| 06-Feb-2024 |
Eric Dumazet <[email protected]> |
net: add exit_batch_rtnl() method
Many (struct pernet_operations)->exit_batch() methods have to acquire rtnl.
In presence of rtnl mutex pressure, this makes cleanup_net() very slow.
This patch add
net: add exit_batch_rtnl() method
Many (struct pernet_operations)->exit_batch() methods have to acquire rtnl.
In presence of rtnl mutex pressure, this makes cleanup_net() very slow.
This patch adds a new exit_batch_rtnl() method to reduce number of rtnl acquisitions from cleanup_net().
exit_batch_rtnl() handlers are called while rtnl is locked, and devices to be killed can be queued in a list provided as their second argument.
A single unregister_netdevice_many() is called right before rtnl is released.
exit_batch_rtnl() handlers are called before ->exit() and ->exit_batch() handlers.
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Antoine Tenart <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3 |
|
| #
ffabe98c |
| 02-Feb-2024 |
Eric Dumazet <[email protected]> |
net: make dev_unreg_count global
We can use a global dev_unreg_count counter instead of a per netns one.
As a bonus we can factorize the changes done on it for bulk device removals.
Signed-off-by:
net: make dev_unreg_count global
We can use a global dev_unreg_count counter instead of a per netns one.
As a bonus we can factorize the changes done on it for bulk device removals.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6 |
|
| #
2034d90a |
| 13-Oct-2023 |
Jiri Pirko <[email protected]> |
net: treat possible_net_t net pointer as an RCU one and add read_pnet_rcu()
Make the net pointer stored in possible_net_t structure annotated as an RCU pointer. Change the access helpers to treat it
net: treat possible_net_t net pointer as an RCU one and add read_pnet_rcu()
Make the net pointer stored in possible_net_t structure annotated as an RCU pointer. Change the access helpers to treat it as such. Introduce read_pnet_rcu() helper to allow caller to dereference the net pointer under RCU read lock.
Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6 |
|
| #
e1b41e4f |
| 09-Aug-2023 |
Joel Granados <[email protected]> |
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
Replace SIZE_MAX with ARRAY_SIZE in the register_net_sysctl macro. Now that all the callers to register_net_sysctl are actual arrays, we can call
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
Replace SIZE_MAX with ARRAY_SIZE in the register_net_sysctl macro. Now that all the callers to register_net_sysctl are actual arrays, we can call ARRAY_SIZE() without any compilation warnings. By calculating the actual array size, this commit is making sure that register_net_sysctl and all its callers forward the table_size into sysctl backend for when the sentinel elements in the ctl_table arrays (last empty markers) are removed. Without it the removal would fail lacking a stopping criteria for traversing the ctl_table arrays.
Stopping condition continues to be based on both table size and the procname null test. This is needed in order to allow for the systematic removal al the sentinel element in subsequent commits: Before removing sentinel the stopping criteria will be the last null element. When the sentinel is removed then the (correct) size will take over.
Signed-off-by: Joel Granados <[email protected]> Suggested-by: Jani Nikula <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
show more ...
|
| #
95d49778 |
| 09-Aug-2023 |
Joel Granados <[email protected]> |
sysctl: Add size to register_net_sysctl function
This commit adds size to the register_net_sysctl indirection function to facilitate the removal of the sentinel elements (last empty markers) from th
sysctl: Add size to register_net_sysctl function
This commit adds size to the register_net_sysctl indirection function to facilitate the removal of the sentinel elements (last empty markers) from the ctl_table arrays. Though we don't actually remove any sentinels in this commit, register_net_sysctl* now has the capability of forwarding table_size for when that happens.
We create a new function register_net_sysctl_sz with an extra size argument. A macro replaces the existing register_net_sysctl. The size in the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors while we systematically migrate to register_net_sysctl_sz. Will change to ARRAY_SIZE in subsequent commits.
Care is taken to add table_size to the stopping criteria in such a way that when we remove the empty sentinel element, it will continue stopping in the last element of the ctl_table array.
Signed-off-by: Joel Granados <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc5, v6.5-rc4 |
|
| #
759ab1ed |
| 26-Jul-2023 |
Jakub Kicinski <[email protected]> |
net: store netdevs in an xarray
Iterating over the netdev hash table for netlink dumps is hard. Dumps are done in "chunks" so we need to save the position after each chunk, so we know where to resta
net: store netdevs in an xarray
Iterating over the netdev hash table for netlink dumps is hard. Dumps are done in "chunks" so we need to save the position after each chunk, so we know where to restart from. Because netdevs are stored in a hash table we remember which bucket we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may come and go while we're dumping. If that happens we may miss a device (if device is deleted from the bucket we were in). We indicate to user space that this may have happened by setting NLM_F_DUMP_INTR. User space is supposed to dump again (I think) if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state: start: # [A, B, C] del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we can iterate ifindexes in order, without the worry that we'll skip one. We may still generate a dump of a state which "never existed", for example for a set of values and sequence of ops:
System state: start: # [A, B] add: C # [A, C, B] del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B. System has never been in such state. But I'm 90% sure that's perfectly fine, important part is that we can't _miss_ devices which exist before and after. User space which wants to mirror kernel's state subscribes to notifications and does periodic dumps so it will know that C exists from the notification about its creation or from the next dump (next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most net namespaces have very few devices and microbenchmarking 1M lookups on Skylake I get the following results (not counting loopback to number of devs):
#devs | hash | xa | delta 2 | 18.3 | 20.1 | + 9.8% 16 | 18.3 | 20.1 | + 9.5% 64 | 18.3 | 26.3 | +43.8% 128 | 20.4 | 26.3 | +28.6% 256 | 20.0 | 26.4 | +32.1% 1024 | 26.6 | 26.7 | + 0.2% 8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries. The microbenchmark scans indexes in order, if the pattern is more random xa starts to win at 512 devices already. But that's a lot of devices, in practice.
Reviewed-by: Leon Romanovsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2 |
|
| #
0cafd77d |
| 20-Oct-2022 |
Eric Dumazet <[email protected]> |
net: add a refcount tracker for kernel sockets
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") added a tracker to sockets, but did not track kernel sockets.
We still have syz
net: add a refcount tracker for kernel sockets
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") added a tracker to sockets, but did not track kernel sockets.
We still have syzbot reports hinting about netns being destroyed while some kernel TCP sockets had not been dismantled.
This patch tracks kernel sockets, and adds a ref_tracker_dir_print() call to net_free() right before the netns is freed.
Normally, each layer is responsible for properly releasing its kernel sockets before last call to net_free().
This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Tested-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3 |
|
| #
b0381776 |
| 15-Jun-2022 |
Vlad Buslov <[email protected]> |
netfilter: nf_flow_table: count pending offload workqueue tasks
To improve hardware offload debuggability count pending 'add', 'del' and 'stats' flow_table offload workqueue tasks. Counters are incr
netfilter: nf_flow_table: count pending offload workqueue tasks
To improve hardware offload debuggability count pending 'add', 'del' and 'stats' flow_table offload workqueue tasks. Counters are incremented before scheduling new task and decremented when workqueue handler finishes executing. These counters allow user to diagnose congestion on hardware offload workqueues that can happen when either CPU is starved and workqueue jobs are executed at lower rate than new ones are added or when hardware/driver can't keep up with the rate.
Implement the described counters as percpu counters inside new struct netns_ft which is stored inside struct net. Expose them via new procfs file '/proc/net/stats/nf_flowtable' that is similar to existing 'nf_conntrack' file.
Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: Oz Shlomo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
show more ...
|
| #
b6e81138 |
| 21-Jun-2022 |
Kuniyuki Iwashima <[email protected]> |
af_unix: Define a per-netns hash table.
This commit adds a per netns hash table for AF_UNIX, which size is fixed as UNIX_HASH_SIZE for now.
The first implementation defines a per-netns hash table a
af_unix: Define a per-netns hash table.
This commit adds a per netns hash table for AF_UNIX, which size is fixed as UNIX_HASH_SIZE for now.
The first implementation defines a per-netns hash table as a single array of lock and list:
struct unix_hashbucket { spinlock_t lock; struct hlist_head head; };
struct netns_unix { struct unix_hashbucket *hash; ... };
But, Eric pointed out memory cost that the structure has holes because of sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0] It could be expensive on a host with thousands of netns and few AF_UNIX sockets. For this reason, a per-netns hash table uses two dense arrays.
struct unix_table { spinlock_t *locks; struct hlist_head *buckets; };
struct netns_unix { struct unix_table table; ... };
Note the length of the list has a significant impact rather than lock contention, so having shared locks can be an option. But, per-netns locks and lists still perform better than the global locks and per-netns lists. [1]
Also, this patch adds a change so that struct netns_unix disappears from struct net if CONFIG_UNIX is disabled.
[0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/ [1]: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4 |
|
| #
ede6c39c |
| 10-Feb-2022 |
Eric Dumazet <[email protected]> |
net: make net->dev_unreg_count atomic
Having to acquire rtnl from netdev_run_todo() for every dismantled device is not desirable when/if rtnl is under stress.
Signed-off-by: Eric Dumazet <edumazet@
net: make net->dev_unreg_count atomic
Having to acquire rtnl from netdev_run_todo() for every dismantled device is not desirable when/if rtnl is under stress.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc3 |
|
| #
9c1be193 |
| 05-Feb-2022 |
Eric Dumazet <[email protected]> |
net: initialize init_net earlier
While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initia
net: initialize init_net earlier
While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized.
This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS);
This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net)
We had too many prior patches [1] trying to work around the root cause.
Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time.
Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init()
As a bonus, this patch saves around 4KB in .data section.
[1]
f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace") b5082df8019a ("net: Initialise init_net.count to 1") 734b65417b24 ("net: Statically initialize init_net.dev_base_head")
v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5 |
|
| #
9ba74e6c |
| 10-Dec-2021 |
Eric Dumazet <[email protected]> |
net: add networking namespace refcount tracker
We have 100+ syzbot reports about netns being dismantled too soon, still unresolved as of today.
We think a missing get_net() or an extra put_net() is
net: add networking namespace refcount tracker
We have 100+ syzbot reports about netns being dismantled too soon, still unresolved as of today.
We think a missing get_net() or an extra put_net() is the root cause.
In order to find the bug(s), and be able to spot future ones, this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers to precisely pair all put_net() with corresponding get_net().
To use these helpers, each data structure owning a refcount should also use a "netns_tracker" to pair the get and put.
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3 |
|
| #
f2e3778d |
| 22-Jul-2021 |
Florian Westphal <[email protected]> |
netfilter: remove xt pernet data
clusterip is now handled via net_generic.
NOTRACK is tiny compared to rest of xt_CT feature set, even the existing deprecation warning is bigger than the actual fun
netfilter: remove xt pernet data
clusterip is now handled via net_generic.
NOTRACK is tiny compared to rest of xt_CT feature set, even the existing deprecation warning is bigger than the actual functionality.
Just remove the warning, its not worth keeping/adding a net_generic one.
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
show more ...
|
| #
889b7da2 |
| 29-Jul-2021 |
Jeremy Kerr <[email protected]> |
mctp: Add initial routing framework
Add a simple routing table, and a couple of route output handlers, and the mctp packet_type & handler.
Includes changes from Matt Johnston <[email protected]
mctp: Add initial routing framework
Add a simple routing table, and a couple of route output handlers, and the mctp packet_type & handler.
Includes changes from Matt Johnston <[email protected]>.
Signed-off-by: Jeremy Kerr <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7 |
|
| #
e34492de |
| 14-Jun-2021 |
Changbin Du <[email protected]> |
net: inline function get_net_ns_by_fd if NET_NS is disabled
The function get_net_ns_by_fd() could be inlined when NET_NS is not enabled.
Signed-off-by: Changbin Du <[email protected]> Signed-of
net: inline function get_net_ns_by_fd if NET_NS is disabled
The function get_net_ns_by_fd() could be inlined when NET_NS is not enabled.
Signed-off-by: Changbin Du <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.13-rc6 |
|
| #
ea6932d7 |
| 11-Jun-2021 |
Changbin Du <[email protected]> |
net: make get_net_ns return error if NET_NS is disabled
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_ope
net: make get_net_ns return error if NET_NS is disabled
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case.
[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c
The same to tun SIOCGSKNS command.
To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c.
Signed-off-by: Changbin Du <[email protected]> Fixes: c62cce2caee5 ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: David Laight <[email protected]> Cc: Christian Brauner <[email protected]> Suggested-by: Jakub Kicinski <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
194730a9 |
| 16-Jun-2021 |
Guvenc Gulce <[email protected]> |
net/smc: Make SMC statistics network namespace aware
Make the gathered SMC statistics network namespace aware, for each namespace collect an own set of statistic information.
Signed-off-by: Guvenc
net/smc: Make SMC statistics network namespace aware
Make the gathered SMC statistics network namespace aware, for each namespace collect an own set of statistic information.
Signed-off-by: Guvenc Gulce <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|