|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
8fb5bb16 |
| 13-Feb-2025 |
Michal Luczaj <[email protected]> |
sockmap, vsock: For connectible sockets allow only connected
sockmap expects all vsocks to have a transport assigned, which is expressed in vsock_proto::psock_update_sk_prot(). However, there is an
sockmap, vsock: For connectible sockets allow only connected
sockmap expects all vsocks to have a transport assigned, which is expressed in vsock_proto::psock_update_sk_prot(). However, there is an edge case where an unconnected (connectible) socket may lose its previously assigned transport. This is handled with a NULL check in the vsock/BPF recv path.
Another design detail is that listening vsocks are not supposed to have any transport assigned at all. Which implies they are not supported by the sockmap. But this is complicated by the fact that a socket, before switching to TCP_LISTEN, may have had some transport assigned during a failed connect() attempt. Hence, we may end up with a listening vsock in a sockmap, which blows up quickly:
KASAN: null-ptr-deref in range [0x0000000000000120-0x0000000000000127] CPU: 7 UID: 0 PID: 56 Comm: kworker/7:0 Not tainted 6.14.0-rc1+ Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_read_skb+0x4b/0x90 Call Trace: sk_psock_verdict_data_ready+0xa4/0x2e0 virtio_transport_recv_pkt+0x1ca8/0x2acc vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30
For connectible sockets, instead of relying solely on the state of vsk->transport, tell sockmap to only allow those representing established connections. This aligns with the behaviour for AF_INET and AF_UNIX.
Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <[email protected]> Acked-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
5459cce6 |
| 22-Jan-2025 |
Jiayuan Chen <[email protected]> |
bpf: Disable non stream socket for strparser
Currently, only TCP supports strparser, but sockmap doesn't intercept non-TCP connections to attach strparser. For example, with UDP, although the read/w
bpf: Disable non stream socket for strparser
Currently, only TCP supports strparser, but sockmap doesn't intercept non-TCP connections to attach strparser. For example, with UDP, although the read/write handlers are replaced, strparser is not executed due to the lack of a read_sock operation.
Furthermore, in udp_bpf_recvmsg(), it checks whether the psock has data, and if not, it falls back to the native UDP read interface, making UDP + strparser appear to read correctly. According to its commit history, this behavior is unexpected.
Moreover, since UDP lacks the concept of streams, we intercept it directly.
Fixes: 1fa1fe8ff161 ("bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0") Signed-off-by: Jiayuan Chen <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
ed1fc5d7 |
| 02-Dec-2024 |
Michal Luczaj <[email protected]> |
bpf, sockmap: Fix race between element replace and close()
Element replace (with a socket different from the one stored) may race with socket's close() link popping & unlinking. __sock_map_delete()
bpf, sockmap: Fix race between element replace and close()
Element replace (with a socket different from the one stored) may race with socket's close() link popping & unlinking. __sock_map_delete() unconditionally unrefs the (wrong) element:
// set map[0] = s0 map_update_elem(map, 0, s0)
// drop fd of s0 close(s0) sock_map_close() lock_sock(sk) (s0!) sock_map_remove_links(sk) link = sk_psock_link_pop() sock_map_unlink(sk, link) sock_map_delete_from_link // replace map[0] with s1 map_update_elem(map, 0, s1) sock_map_update_elem (s1!) lock_sock(sk) sock_map_update_common psock = sk_psock(sk) spin_lock(&stab->lock) osk = stab->sks[idx] sock_map_add_link(..., &stab->sks[idx]) sock_map_unref(osk, &stab->sks[idx]) psock = sk_psock(osk) sk_psock_put(sk, psock) if (refcount_dec_and_test(&psock)) sk_psock_drop(sk, psock) spin_unlock(&stab->lock) unlock_sock(sk) __sock_map_delete spin_lock(&stab->lock) sk = *psk // s1 replaced s0; sk == s1 if (!sk_test || sk_test == sk) // sk_test (s0) != sk (s1); no branch sk = xchg(psk, NULL) if (sk) sock_map_unref(sk, psk) // unref s1; sks[idx] will dangle psock = sk_psock(sk) sk_psock_put(sk, psock) if (refcount_dec_and_test()) sk_psock_drop(sk, psock) spin_unlock(&stab->lock) release_sock(sk)
Then close(map) enqueues bpf_map_free_deferred, which finally calls sock_map_free(). This results in some refcount_t warnings along with a KASAN splat [1].
Fix __sock_map_delete(), do not allow sock_map_unref() on elements that may have been replaced.
[1]: BUG: KASAN: slab-use-after-free in sock_map_free+0x10e/0x330 Write of size 4 at addr ffff88811f5b9100 by task kworker/u64:12/1063
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Not tainted 6.12.0+ #125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 Workqueue: events_unbound bpf_map_free_deferred Call Trace: <TASK> dump_stack_lvl+0x68/0x90 print_report+0x174/0x4f6 kasan_report+0xb9/0x190 kasan_check_range+0x10f/0x1e0 sock_map_free+0x10e/0x330 bpf_map_free_deferred+0x173/0x320 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x29e/0x360 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK>
Allocated by task 1202: kasan_save_stack+0x1e/0x40 kasan_save_track+0x10/0x30 __kasan_slab_alloc+0x85/0x90 kmem_cache_alloc_noprof+0x131/0x450 sk_prot_alloc+0x5b/0x220 sk_alloc+0x2c/0x870 unix_create1+0x88/0x8a0 unix_create+0xc5/0x180 __sock_create+0x241/0x650 __sys_socketpair+0x1ce/0x420 __x64_sys_socketpair+0x92/0x100 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 46: kasan_save_stack+0x1e/0x40 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x60 __kasan_slab_free+0x4b/0x70 kmem_cache_free+0x1a1/0x590 __sk_destruct+0x388/0x5a0 sk_psock_destroy+0x73e/0xa50 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x29e/0x360 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30
The buggy address belongs to the object at ffff88811f5b9080 which belongs to the cache UNIX-STREAM of size 1984 The buggy address is located 128 bytes inside of freed 1984-byte region [ffff88811f5b9080, ffff88811f5b9840)
The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11f5b8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff888127d49401 flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) page_type: f5(slab) raw: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000 raw: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401 head: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000 head: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401 head: 0017ffffc0000003 ffffea00047d6e01 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88811f5b9000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88811f5b9080: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88811f5b9180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88811f5b9200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Disabling lock debugging due to kernel taint
refcount_t: addition on 0; use-after-free. WARNING: CPU: 14 PID: 1063 at lib/refcount.c:25 refcount_warn_saturate+0xce/0x150 CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G B 6.12.0+ #125 Tainted: [B]=BAD_PAGE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:refcount_warn_saturate+0xce/0x150 Code: 34 73 eb 03 01 e8 82 53 ad fe 0f 0b eb b1 80 3d 27 73 eb 03 00 75 a8 48 c7 c7 80 bd 95 84 c6 05 17 73 eb 03 01 e8 62 53 ad fe <0f> 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05 RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed10bcde6349 R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000 R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024 FS: 0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0 PKRU: 55555554 Call Trace: <TASK> ? __warn.cold+0x5f/0x1ff ? refcount_warn_saturate+0xce/0x150 ? report_bug+0x1ec/0x390 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0xce/0x150 sock_map_free+0x2e5/0x330 bpf_map_free_deferred+0x173/0x320 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x29e/0x360 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 10741 hardirqs last enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20 hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770 softirqs last enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210 softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
refcount_t: underflow; use-after-free. WARNING: CPU: 14 PID: 1063 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150 CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G B W 6.12.0+ #125 Tainted: [B]=BAD_PAGE, [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:refcount_warn_saturate+0xee/0x150 Code: 17 73 eb 03 01 e8 62 53 ad fe 0f 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05 f6 72 eb 03 01 e8 42 53 ad fe <0f> 0b e9 6e ff ff ff 80 3d e6 72 eb 03 00 0f 85 61 ff ff ff 48 c7 RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed10bcde6349 R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000 R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024 FS: 0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0 PKRU: 55555554 Call Trace: <TASK> ? __warn.cold+0x5f/0x1ff ? refcount_warn_saturate+0xee/0x150 ? report_bug+0x1ec/0x390 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0xee/0x150 sock_map_free+0x2d3/0x330 bpf_map_free_deferred+0x173/0x320 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x29e/0x360 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 10741 hardirqs last enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20 hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770 softirqs last enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210 softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
| #
75e072a3 |
| 02-Dec-2024 |
Michal Luczaj <[email protected]> |
bpf, sockmap: Fix update element with same
Consider a sockmap entry being updated with the same socket:
osk = stab->sks[idx]; sock_map_add_link(psock, link, map, &stab->sks[idx]); stab->sks[idx]
bpf, sockmap: Fix update element with same
Consider a sockmap entry being updated with the same socket:
osk = stab->sks[idx]; sock_map_add_link(psock, link, map, &stab->sks[idx]); stab->sks[idx] = sk; if (osk) sock_map_unref(osk, &stab->sks[idx]);
Due to sock_map_unref(), which invokes sock_map_del_link(), all the psock's links for stab->sks[idx] are torn:
list_for_each_entry_safe(link, tmp, &psock->link, list) { if (link->link_raw == link_raw) { ... list_del(&link->list); sk_psock_free_link(link); } }
And that includes the new link sock_map_add_link() added just before the unref.
This results in a sockmap holding a socket, but without the respective link. This in turn means that close(sock) won't trigger the cleanup, i.e. a closed socket will not be automatically removed from the sockmap.
Stop tearing the links when a matching link_raw is found.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5 |
|
| #
740be3b9 |
| 26-Oct-2024 |
Cong Wang <[email protected]> |
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
The following race condition could trigger a NULL pointer dereference:
sock_map_link_detach(): sock_map_link_update_prog():
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
The following race condition could trigger a NULL pointer dereference:
sock_map_link_detach(): sock_map_link_update_prog(): mutex_lock(&sockmap_mutex); ... sockmap_link->map = NULL; mutex_unlock(&sockmap_mutex); mutex_lock(&sockmap_mutex); ... sock_map_prog_link_lookup(sockmap_link->map); mutex_unlock(&sockmap_mutex); <continue>
Fix it by adding a NULL pointer check. In this specific case, it makes no sense to update a link which is being released.
Reported-by: Ruan Bonan <[email protected]> Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs") Cc: Yonghong Song <[email protected]> Cc: John Fastabend <[email protected]> Cc: Jakub Sitnicki <[email protected]> Signed-off-by: Cong Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3 |
|
| #
9c5bd93e |
| 13-Oct-2024 |
Michal Luczaj <[email protected]> |
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure to immediately and visibly fail the forwarding of unsu
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
Don't mislead the callers of bpf_{sk,msg}_redirect_{map,hash}(): make sure to immediately and visibly fail the forwarding of unsupported af_vsock packets.
Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7 |
|
| #
b1339be9 |
| 06-Sep-2024 |
Eric Dumazet <[email protected]> |
sock_map: Add a cond_resched() in sock_hash_free()
Several syzbot soft lockup reports all have in common sock_hash_free()
If a map with a large number of buckets is destroyed, we need to yield the
sock_map: Add a cond_resched() in sock_hash_free()
Several syzbot soft lockup reports all have in common sock_hash_free()
If a map with a large number of buckets is destroyed, we need to yield the cpu when needed.
Fixes: 75e68e5bf2c7 ("bpf, sockhash: Synchronize delete from bucket list on map free") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
55f32595 |
| 30-Jul-2024 |
Al Viro <[email protected]> |
bpf: switch maps to CLASS(fd, ...)
Calling conventions for __bpf_map_get() would be more convenient if it left fpdut() on failure to callers. Makes for simpler logics in the callers.
Amon
bpf: switch maps to CLASS(fd, ...)
Calling conventions for __bpf_map_get() would be more convenient if it left fpdut() on failure to callers. Makes for simpler logics in the callers.
Among other things, the proof of memory safety no longer has to rely upon file->private_data never being ERR_PTR(...) for bpffs files. Original calling conventions made it impossible for the caller to tell whether __bpf_map_get() has returned ERR_PTR(-EINVAL) because it has found the file not be a bpf map one (in which case it would've done fdput()) or because it found that ERR_PTR(-EINVAL) in file->private_data of a bpf map file (in which case fdput() would _not_ have been done).
Signed-off-by: Al Viro <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
4b4647ad |
| 24-May-2024 |
Thadeu Lima de Souza Cascardo <[email protected]> |
sock_map: avoid race between sock_map_close and sk_psock_put
sk_psock_get will return NULL if the refcount of psock has gone to 0, which will happen when the last call of sk_psock_put is done. Howev
sock_map: avoid race between sock_map_close and sk_psock_put
sk_psock_get will return NULL if the refcount of psock has gone to 0, which will happen when the last call of sk_psock_put is done. However, sk_psock_drop may not have finished yet, so the close callback will still point to sock_map_close despite psock being NULL.
This can be reproduced with a thread deleting an element from the sock map, while the second one creates a socket, adds it to the map and closes it.
That will trigger the WARN_ON_ONCE:
------------[ cut here ]------------ WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Modules linked in: CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02 RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293 RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000 RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0 RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3 R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840 R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870 FS: 000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0 Call Trace: <TASK> unix_release+0x87/0xc0 net/unix/af_unix.c:1048 __sock_release net/socket.c:659 [inline] sock_close+0xbe/0x240 net/socket.c:1421 __fput+0x42b/0x8a0 fs/file_table.c:422 __do_sys_close fs/open.c:1556 [inline] __se_sys_close fs/open.c:1541 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1541 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb37d618070 Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070 RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004 RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK>
Use sk_psock, which will only check that the pointer is not been set to NULL yet, which should only happen after the callbacks are restored. If, then, a reference can still be gotten, we may call sk_psock_stop and cancel psock->work.
As suggested by Paolo Abeni, reorder the condition so the control flow is less convoluted.
After that change, the reproducer does not trigger the WARN_ON_ONCE anymore.
Suggested-by: Paolo Abeni <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355 Fixes: aadb2bb83ff7 ("sock_map: Fix a potential use-after-free in sock_map_close()") Fixes: 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself") Cc: [email protected] Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
3b9ce049 |
| 27-May-2024 |
Jakub Sitnicki <[email protected]> |
Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
This reverts commit ff91059932401894e6c86341915615c5eb0eca48.
This check is no longer needed. BPF programs attached to trac
Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
This reverts commit ff91059932401894e6c86341915615c5eb0eca48.
This check is no longer needed. BPF programs attached to tracepoints are now rejected by the verifier when they attempt to delete from a sockmap/sockhash maps.
Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4 |
|
| #
699c23f0 |
| 10-Apr-2024 |
Yonghong Song <[email protected]> |
bpf: Add bpf_link support for sk_msg and sk_skb progs
Add bpf_link support for sk_msg and sk_skb programs. We have an internal request to support bpf_link for sk_msg programs so user space can have
bpf: Add bpf_link support for sk_msg and sk_skb progs
Add bpf_link support for sk_msg and sk_skb programs. We have an internal request to support bpf_link for sk_msg programs so user space can have a uniform handling with bpf_link based libbpf APIs. Using bpf_link based libbpf API also has a benefit which makes system robust by decoupling prog life cycle and attachment life cycle.
Reviewed-by: John Fastabend <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3 |
|
| #
ff910599 |
| 02-Apr-2024 |
Jakub Sitnicki <[email protected]> |
bpf, sockmap: Prevent lock inversion deadlock in map delete elem
syzkaller started using corpuses where a BPF tracing program deletes elements from a sockmap/sockhash map. Because BPF tracing progra
bpf, sockmap: Prevent lock inversion deadlock in map delete elem
syzkaller started using corpuses where a BPF tracing program deletes elements from a sockmap/sockhash map. Because BPF tracing programs can be invoked from any interrupt context, locks taken during a map_delete_elem operation must be hardirq-safe. Otherwise a deadlock due to lock inversion is possible, as reported by lockdep:
CPU0 CPU1 ---- ---- lock(&htab->buckets[i].lock); local_irq_disable(); lock(&host->lock); lock(&htab->buckets[i].lock); <Interrupt> lock(&host->lock);
Locks in sockmap are hardirq-unsafe by design. We expects elements to be deleted from sockmap/sockhash only in task (normal) context with interrupts enabled, or in softirq context.
Detect when map_delete_elem operation is invoked from a context which is _not_ hardirq-unsafe, that is interrupts are disabled, and bail out with an error.
Note that map updates are not affected by this issue. BPF verifier does not allow updating sockmap/sockhash from a BPF tracing program today.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: xingwei lee <[email protected]> Reported-by: yue sun <[email protected]> Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: [email protected] Acked-by: John Fastabend <[email protected]> Closes: https://syzkaller.appspot.com/bug?extid=d4066896495db380182e Closes: https://syzkaller.appspot.com/bug?extid=bc922f476bd65abbd466 Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4 |
|
| #
8d665064 |
| 01-Dec-2023 |
John Fastabend <[email protected]> |
bpf: syzkaller found null ptr deref in unix_bpf proto add
I added logic to track the sock pair for stream_unix sockets so that we ensure lifetime of the sock matches the time a sockmap could referen
bpf: syzkaller found null ptr deref in unix_bpf proto add
I added logic to track the sock pair for stream_unix sockets so that we ensure lifetime of the sock matches the time a sockmap could reference the sock (see fixes tag). I forgot though that we allow af_unix unconnected sockets into a sock{map|hash} map.
This is problematic because previous fixed expected sk_pair() to exist and did not NULL check it. Because unconnected sockets have a NULL sk_pair this resulted in the NULL ptr dereference found by syzkaller.
BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171 Write of size 4 at addr 0000000000000080 by task syz-executor360/5073 Call Trace: <TASK> ... sock_hold include/net/sock.h:777 [inline] unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171 sock_map_init_proto net/core/sock_map.c:190 [inline] sock_map_link+0xb87/0x1100 net/core/sock_map.c:294 sock_map_update_common+0xf6/0x870 net/core/sock_map.c:483 sock_map_update_elem_sys+0x5b6/0x640 net/core/sock_map.c:577 bpf_map_update_value+0x3af/0x820 kernel/bpf/syscall.c:167
We considered just checking for the null ptr and skipping taking a ref on the NULL peer sock. But, if the socket is then connected() after being added to the sockmap we can cause the original issue again. So instead this patch blocks adding af_unix sockets that are not in the ESTABLISHED state.
Reported-by: Eric Dumazet <[email protected]> Reported-by: [email protected] Fixes: 8866730aed51 ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock") Acked-by: Jakub Sitnicki <[email protected]> Signed-off-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3 |
|
| #
b80e31ba |
| 20-Sep-2023 |
Jakub Sitnicki <[email protected]> |
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
With a SOCKMAP/SOCKHASH map and an sk_msg program user can steer messages sent from one TCP socket (s1) to actually egress from anothe
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
With a SOCKMAP/SOCKHASH map and an sk_msg program user can steer messages sent from one TCP socket (s1) to actually egress from another TCP socket (s2):
tcp_bpf_sendmsg(s1) // = sk_prot->sendmsg tcp_bpf_send_verdict(s1) // __SK_REDIRECT case tcp_bpf_sendmsg_redir(s2) tcp_bpf_push_locked(s2) tcp_bpf_push(s2) tcp_rate_check_app_limited(s2) // expects tcp_sock tcp_sendmsg_locked(s2) // ditto
There is a hard-coded assumption in the call-chain, that the egress socket (s2) is a TCP socket.
However in commit 122e6c79efe1 ("sock_map: Update sock type checks for UDP") we have enabled redirects to non-TCP sockets. This was done for the sake of BPF sk_skb programs. There was no indention to support sk_msg send-to-egress use case.
As a result, attempts to send-to-egress through a non-TCP socket lead to a crash due to invalid downcast from sock to tcp_sock:
BUG: kernel NULL pointer dereference, address: 000000000000002f ... Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x1f/0x70 ? page_fault_oops+0x80/0x160 ? do_user_addr_fault+0x2d7/0x800 ? rcu_is_watching+0x11/0x50 ? exc_page_fault+0x70/0x1c0 ? asm_exc_page_fault+0x27/0x30 ? tcp_tso_segs+0x14/0xa0 tcp_write_xmit+0x67/0xce0 __tcp_push_pending_frames+0x32/0xf0 tcp_push+0x107/0x140 tcp_sendmsg_locked+0x99f/0xbb0 tcp_bpf_push+0x19d/0x3a0 tcp_bpf_sendmsg_redir+0x55/0xd0 tcp_bpf_send_verdict+0x407/0x550 tcp_bpf_sendmsg+0x1a1/0x390 inet_sendmsg+0x6a/0x70 sock_sendmsg+0x9d/0xc0 ? sockfd_lookup_light+0x12/0x80 __sys_sendto+0x10e/0x160 ? syscall_enter_from_user_mode+0x20/0x60 ? __this_cpu_preempt_check+0x13/0x20 ? lockdep_hardirqs_on+0x82/0x110 __x64_sys_sendto+0x1f/0x30 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reject selecting a non-TCP sockets as redirect target from a BPF sk_msg program to prevent the crash. When attempted, user will receive an EACCES error from send/sendto/sendmsg() syscall.
Fixes: 122e6c79efe1 ("sock_map: Update sock type checks for UDP") Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1 |
|
| #
35d2b7ff |
| 30-Aug-2023 |
John Fastabend <[email protected]> |
bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
Sockmap and sockhash maps are a collection of psocks that are objects representing a socket plus a set of metadata needed to manage the
bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
Sockmap and sockhash maps are a collection of psocks that are objects representing a socket plus a set of metadata needed to manage the BPF programs associated with the socket. These maps use the stab->lock to protect from concurrent operations on the maps, e.g. trying to insert to objects into the array at the same time in the same slot. Additionally, a sockhash map has a bucket lock to protect iteration and insert/delete into the hash entry.
Each psock has a psock->link which is a linked list of all the maps that a psock is attached to. This allows a psock (socket) to be included in multiple sockmap and sockhash maps. This linked list is protected the psock->link_lock.
They _must_ be nested correctly to avoid deadlock:
lock(stab->lock) : do BPF map operations and psock insert/delete lock(psock->link_lock) : add map to psock linked list of maps unlock(psock->link_lock) unlock(stab->lock)
For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t are guaranteed to not sleep. But, with PREEMPT_RT kernels the spin_lock_t variants may sleep. In the current code we have many patterns like this:
rcu_critical_section: raw_spin_lock(stab->lock) spin_lock(psock->link_lock) <- may sleep ouch spin_unlock(psock->link_lock) raw_spin_unlock(stab->lock) rcu_critical_section
Nesting spin_lock() inside a raw_spin_lock() violates locking rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS) inside the stab->lock, but those might sleep on PREEMPT_RT kernels. The result is splats like this:
./test_progs -t sockmap_basic [ 33.344330] bpf_testmod: loading out-of-tree module taints kernel. [ 33.441933] [ 33.442089] ============================= [ 33.442421] [ BUG: Invalid wait context ] [ 33.442763] 6.5.0-rc5-01731-gec0ded2e0282 #4958 Tainted: G O [ 33.443320] ----------------------------- [ 33.443624] test_progs/2073 is trying to lock: [ 33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0 [ 33.444636] other info that might help us debug this: [ 33.444991] context-{5:5} [ 33.445183] 3 locks held by test_progs/2073: [ 33.445498] #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330 [ 33.446159] #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330 [ 33.446809] #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0 [ 33.447445] stack backtrace: [ 33.447655] CPU: 10 PID
To fix observe we can't readily remove the allocations (for that we would need to use/create something similar to bpf_map_alloc). So convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update that would trigger the allocate and potential sleep is only allowed through sys_bpf ops and via sock_ops which precludes hw interrupts and low level atomic sections in RT preempt kernel. On non RT preempt kernel there are no changes here and spin locks sections and alloc(GFP_ATOMIC) are still not sleepable.
Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5 |
|
| #
7e96ec0e |
| 04-Aug-2023 |
Xu Kuohai <[email protected]> |
bpf, sockmap: Fix map type error in sock_map_del_link
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although both types have member named "progs", the offset of "progs" member in these
bpf, sockmap: Fix map type error in sock_map_del_link
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although both types have member named "progs", the offset of "progs" member in these two types is different, so "progs" should be accessed with the real map type.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xu Kuohai <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc4 |
|
| #
13d2618b |
| 28-Jul-2023 |
Tomas Glozar <[email protected]> |
bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC allocation later in sk_psock_init_link on PREEMPT_RT kernels, since
bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC allocation later in sk_psock_init_link on PREEMPT_RT kernels, since GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist patchset notes for details).
This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to BUG (sleeping function called from invalid context) on RT kernels.
preempt_disable was introduced together with lock_sk and rcu_read_lock in commit 99ba2b5aba24e ("bpf: sockhash, disallow bpf_tcp_close and update in parallel"), probably to match disabled migration of BPF programs, and is no longer necessary.
Remove preempt_disable to fix BUG in sock_map_update_common on RT.
Signed-off-by: Tomas Glozar <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 99ba2b5aba24 ("bpf: sockhash, disallow bpf_tcp_close and update in parallel") Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7 |
|
| #
6c3eba1c |
| 13-Jun-2023 |
Andrii Nakryiko <[email protected]> |
bpf: Centralize permissions checks for all BPF map types
This allows to do more centralized decisions later on, and generally makes it very explicit which maps are privileged and which are not (e.g.
bpf: Centralize permissions checks for all BPF map types
This allows to do more centralized decisions later on, and generally makes it very explicit which maps are privileged and which are not (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants, as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit and easy to verify).
Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.4-rc6, v6.4-rc5, v6.4-rc4 |
|
| #
29173d07 |
| 23-May-2023 |
John Fastabend <[email protected]> |
bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attachi
bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser.
The flow for Cilium's common use case with a stream parser is,
tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work);
The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next.
Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side.
To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'.
To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue.
>From on list discussion. This commit
bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
was intended to address similar race, but had a couple cases it missed. Most obvious it only accounted for receiving traffic on the local socket so if redirecting into another socket we could still get an sk_buff stuck here. Next it missed the case where copied=0 in the recv() handler and then we wouldn't kick the scheduler. Also its sub-optimal to require userspace to kick the internal mechanisms of sockmap to wake it up and copy data to user. It results in an extra syscall and requires the app to actual handle the EAGAIN correctly.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: William Findlay <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7 |
|
| #
8c5c2a48 |
| 13-Apr-2023 |
Daniel Borkmann <[email protected]> |
bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap
syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap"):
bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap
syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap"):
[...] WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Modules linked in: CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344561d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 [...] Call Trace: <TASK> spin_unlock_bh include/linux/spinlock.h:395 [inline] sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165 sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184 sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945 map_delete_elem kernel/bpf/syscall.c:1536 [inline] __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053 __do_sys_bpf kernel/bpf/syscall.c:5166 [inline] __se_sys_bpf kernel/bpf/syscall.c:5164 [inline] __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe8f7c8c169 </TASK> [...]
Revert for now until we have a proper solution.
Fixes: ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap") Reported-by: [email protected] Cc: Hsin-Wei Hung <[email protected]> Cc: Xin Liu <[email protected]> Cc: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]/
show more ...
|
|
Revision tags: v6.3-rc6 |
|
| #
ed17aa92 |
| 06-Apr-2023 |
Xin Liu <[email protected]> |
bpf, sockmap: fix deadlocks in the sockhash and sockmap
When huang uses sched_switch tracepoint, the tracepoint does only one thing in the mounted ebpf program, which deletes the fixed elements in s
bpf, sockmap: fix deadlocks in the sockhash and sockmap
When huang uses sched_switch tracepoint, the tracepoint does only one thing in the mounted ebpf program, which deletes the fixed elements in sockhash ([0])
It seems that elements in sockhash are rarely actively deleted by users or ebpf program. Therefore, we do not pay much attention to their deletion. Compared with hash maps, sockhash only provides spin_lock_bh protection. This causes it to appear to have self-locking behavior in the interrupt context.
[0]:https://lore.kernel.org/all/CABcoxUayum5oOqFMMqAeWuS8+EzojquSOSyDA3J_2omY=2EeAg@mail.gmail.com/
Reported-by: Hsin-Wei Hung <[email protected]> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xin Liu <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc5, v6.3-rc4 |
|
| #
d7ba4cc9 |
| 22-Mar-2023 |
JP Kobryn <[email protected]> |
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where previously int was returned. Using long allows for bpf programs to maintain the si
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where previously int was returned. Using long allows for bpf programs to maintain the sign bit in the absence of sign extension during situations where inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative error is returned.
The definitions of the helper funcs are generated from comments in the bpf uapi header at `include/uapi/linux/bpf.h`. The return type of these helpers was previously changed from int to long in commit bdb7b79b4ce8. For any case where one of the map helpers call the bpf_map_ops funcs that are still returning 32-bit int, a compiler might not include sign extension instructions to properly convert the 32-bit negative value a 64-bit negative value.
For example: bpf assembly excerpt of an inlined helper calling a kernel function and checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST); ... 46: call 0xffffffffe103291c ; htab_map_update_elem ; if (err && err != -EEXIST) { 4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from `htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d ... movl %r9d, %eax
...results in the comparison: cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long") Tested-by: Eduard Zingerman <[email protected]> Signed-off-by: JP Kobryn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc3, v6.3-rc2, v6.3-rc1 |
|
| #
73d2c619 |
| 05-Mar-2023 |
Yafang Shao <[email protected]> |
bpf, net: sock_map memory usage
sockmap and sockhash don't have something in common in allocation, so let's introduce different helpers to calculate their memory usage.
The reuslt as follows,
- be
bpf, net: sock_map memory usage
sockmap and sockhash don't have something in common in allocation, so let's introduce different helpers to calculate their memory usage.
The reuslt as follows,
- before 28: sockmap name count_map flags 0x0 key 4B value 4B max_entries 65536 memlock 524288B 29: sockhash name count_map flags 0x0 key 4B value 4B max_entries 65536 memlock 524288B
- after 28: sockmap name count_map flags 0x0 key 4B value 4B max_entries 65536 memlock 524608B 29: sockhash name count_map flags 0x0 <<<< no updated elements key 4B value 4B max_entries 65536 memlock 1048896B
Signed-off-by: Yafang Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
5b4a79ba |
| 21-Jan-2023 |
Jakub Sitnicki <[email protected]> |
bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
sock_map proto callbacks should never call themselves by design. Protect against bugs like [1] and break out of the recursive loop
bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
sock_map proto callbacks should never call themselves by design. Protect against bugs like [1] and break out of the recursive loop to avoid a stack overflow in favor of a resource leak.
[1] https://lore.kernel.org/all/[email protected]/
Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8 |
|
| #
0a182f8d |
| 02-Dec-2022 |
Eric Dumazet <[email protected]> |
bpf, sockmap: fix race in sock_map_free()
sock_map_free() calls release_sock(sk) without owning a reference on the socket. This can cause use-after-free as syzbot found [1]
Jakub Sitnicki already t
bpf, sockmap: fix race in sock_map_free()
sock_map_free() calls release_sock(sk) without owning a reference on the socket. This can cause use-after-free as syzbot found [1]
Jakub Sitnicki already took care of a similar issue in sock_hash_free() in commit 75e68e5bf2c7 ("bpf, sockhash: Synchronize delete from bucket list on map free")
[1] refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 3785 at lib/refcount.c:31 refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 3785 Comm: kworker/u4:6 Not tainted 6.1.0-rc7-syzkaller-00103-gef4d3ea40565 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Code: 68 8b 31 c0 e8 75 71 15 fd 0f 0b e9 64 ff ff ff e8 d9 6e 4e fd c6 05 62 9c 3d 0a 01 48 c7 c7 80 bb 68 8b 31 c0 e8 54 71 15 fd <0f> 0b e9 43 ff ff ff 89 d9 80 e1 07 80 c1 03 38 c1 0f 8c a2 fe ff RSP: 0018:ffffc9000456fb60 EFLAGS: 00010246 RAX: eae59bab72dcd700 RBX: 0000000000000004 RCX: ffff8880207057c0 RDX: 0000000000000000 RSI: 0000000000000201 RDI: 0000000000000000 RBP: 0000000000000004 R08: ffffffff816fdabd R09: fffff520008adee5 R10: fffff520008adee5 R11: 1ffff920008adee4 R12: 0000000000000004 R13: dffffc0000000000 R14: ffff88807b1c6c00 R15: 1ffff1100f638dcf FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b30c30000 CR3: 000000000d08e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] __sock_put include/net/sock.h:779 [inline] tcp_release_cb+0x2d0/0x360 net/ipv4/tcp_output.c:1092 release_sock+0xaf/0x1c0 net/core/sock.c:3468 sock_map_free+0x219/0x2c0 net/core/sock_map.c:356 process_one_work+0x81c/0xd10 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK>
Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: John Fastabend <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Song Liu <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|