|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
4b82b181 |
| 24-Feb-2025 |
Yonghong Song <[email protected]> |
bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array is computed from bottom cgroup to upper cgroups (post-ordering). For example, the
bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array is computed from bottom cgroup to upper cgroups (post-ordering). For example, the following cgroup hierarchy root cgroup: p1, p2 subcgroup: p3, p4 have BPF_F_ALLOW_MULTI for both cgroup levels. The effective cgroup array ordering looks like p3 p4 p1 p2 and at run time, progs will execute based on that order.
But in some cases, it is desirable to have root prog executes earlier than children progs (pre-ordering). For example, - prog p1 intends to collect original pkt dest addresses. - prog p3 will modify original pkt dest addresses to a proxy address for security reason. The end result is that prog p1 gets proxy address which is not what it wants. Putting p1 to every child cgroup is not desirable either as it will duplicate itself in many child cgroups. And this is exactly a use case we are encountering in Meta.
To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag is specified at attachment time, the prog has higher priority and the ordering with that flag will be from top to bottom (pre-ordering). For example, in the above example, root cgroup: p1, p2 subcgroup: p3, p4 Let us say p2 and p4 are marked with BPF_F_PREORDER. The final effective array ordering will be p2 p4 p3 p1
Suggested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
78e2baf3 |
| 10-Oct-2024 |
Eric Dumazet <[email protected]> |
net: add TIME_WAIT logic to sk_to_full_sk()
TCP will soon attach TIME_WAIT sockets to some ACK and RST.
Make sure sk_to_full_sk() detects this and does not return a non full socket.
v3: also chang
net: add TIME_WAIT logic to sk_to_full_sk()
TCP will soon attach TIME_WAIT sockets to some ACK and RST.
Make sure sk_to_full_sk() detects this and does not return a non full socket.
v3: also changed sk_const_to_full_sk()
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Martin KaFai Lau <[email protected]> Reviewed-by: Brian Vazquez <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3 |
|
| #
57e3707e |
| 05-Aug-2024 |
Thomas Weißschuh <[email protected]> |
bpf: Constify ctl_table argument of filter function
The sysctl core is moving to allow "struct ctl_table" in read-only memory. As a preparation for that all functions handling "struct ctl_table" nee
bpf: Constify ctl_table argument of filter function
The sysctl core is moving to allow "struct ctl_table" in read-only memory. As a preparation for that all functions handling "struct ctl_table" need to be able to work with "const struct ctl_table". As __cgroup_bpf_run_filter_sysctl() does not modify its table, it can be adapted trivially.
Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
| #
33f339a1 |
| 30-Aug-2024 |
Tze-nan Wu <[email protected]> |
bpf, net: Fix a potential race in do_sock_getsockopt()
There's a potential race when `cgroup_bpf_enabled(CGROUP_GETSOCKOPT)` is false during the execution of `BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN`, but
bpf, net: Fix a potential race in do_sock_getsockopt()
There's a potential race when `cgroup_bpf_enabled(CGROUP_GETSOCKOPT)` is false during the execution of `BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN`, but becomes true when `BPF_CGROUP_RUN_PROG_GETSOCKOPT` is called. This inconsistency can lead to `BPF_CGROUP_RUN_PROG_GETSOCKOPT` receiving an "-EFAULT" from `__cgroup_bpf_run_filter_getsockopt(max_optlen=0)`. Scenario shown as below:
`process A` `process B` ----------- ------------ BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN enable CGROUP_GETSOCKOPT BPF_CGROUP_RUN_PROG_GETSOCKOPT (-EFAULT)
To resolve this, remove the `BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN` macro and directly uses `copy_from_sockptr` to ensure that `max_optlen` is always set before `BPF_CGROUP_RUN_PROG_GETSOCKOPT` is invoked.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Co-developed-by: Yanghui Li <[email protected]> Signed-off-by: Yanghui Li <[email protected]> Co-developed-by: Cheng-Jui Wang <[email protected]> Signed-off-by: Cheng-Jui Wang <[email protected]> Signed-off-by: Tze-nan Wu <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4 |
|
| #
32e18e76 |
| 09-Feb-2024 |
Oliver Crumrine <[email protected]> |
bpf: remove check in __cgroup_bpf_run_filter_skb
Originally, this patch removed a redundant check in BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in the function it called, _
bpf: remove check in __cgroup_bpf_run_filter_skb
Originally, this patch removed a redundant check in BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in the function it called, __cgroup_bpf_run_filter_skb. For v2, it was reccomended that I remove the check from __cgroup_bpf_run_filter_skb, and add the checks to the other macro that calls that function, BPF_CGROUP_RUN_PROG_INET_INGRESS.
To sum it up, checking that the socket exists and that it is a full socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the function they call, __cgroup_bpf_run_filter_skb.
v3->v4: Fixed weird merge conflict. v2->v3: Sent to bpf-next instead of generic patch v1->v2: Addressed feedback about where check should be removed.
Signed-off-by: Oliver Crumrine <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3 Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7 |
|
| #
0b05b0cd |
| 16-Oct-2023 |
Breno Leitao <[email protected]> |
net/socket: Break down __sys_getsockopt
Split __sys_getsockopt() into two functions by removing the core logic into a sub-function (do_sock_getsockopt()). This will avoid code duplication when doing
net/socket: Break down __sys_getsockopt
Split __sys_getsockopt() into two functions by removing the core logic into a sub-function (do_sock_getsockopt()). This will avoid code duplication when doing the same operation in other callers, for instance.
do_sock_getsockopt() will be called by io_uring getsockopt() command operation in the following patch.
The same was done for the setsockopt pair.
Suggested-by: Martin KaFai Lau <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
3f31e0d1 |
| 16-Oct-2023 |
Breno Leitao <[email protected]> |
bpf: Add sockptr support for setsockopt
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in setsockptr BPF hooks, so, it could be used by o
bpf: Add sockptr support for setsockopt
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in setsockptr BPF hooks, so, it could be used by other callers.
The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen.
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Breno Leitao <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
a615f67e |
| 16-Oct-2023 |
Breno Leitao <[email protected]> |
bpf: Add sockptr support for getsockopt
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in getsockptr BPF hooks, so, it could be used by o
bpf: Add sockptr support for getsockopt
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in getsockptr BPF hooks, so, it could be used by other callers.
The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen.
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Breno Leitao <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6 |
|
| #
859051dd |
| 11-Oct-2023 |
Daan De Meyer <[email protected]> |
bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get wri
bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen().
These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets.
We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls.
We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Daan De Meyer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
| #
fefba7d1 |
| 11-Oct-2023 |
Daan De Meyer <[email protected]> |
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after runni
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents.
__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4 |
|
| #
223f5f79 |
| 24-Jun-2023 |
Kui-Feng Lee <[email protected]> |
bpf, net: Check skb ownership against full socket.
Check skb ownership of an skb against full sockets instead of request_sock.
The filters were called only if an skb is owned by the sock that the s
bpf, net: Check skb ownership against full socket.
Check skb ownership of an skb against full sockets instead of request_sock.
The filters were called only if an skb is owned by the sock that the skb is sent out through. In another words, skb->sk should point to the sock that it is sending through its egress. However, the filters would miss SYN/ACK skbs that they are owned by a request_sock but sent through the listener sock, that is the socket listening incoming connections.
However, the listener socket is also the full socket of the request socket. We should use the full socket as the owner socket of an skb instead.
What is the ownership check for? ================================
BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the ownership of an skb. Alexei referred to a mailing list conversation [0] that took place a few years ago. In that conversation, Daniel Borkmann stated that:
Wouldn't that mean however, when you go through stacked devices that you'd run the same eBPF cgroup program for skb->sk multiple times?
According to what Daniel said, the ownership check mentioned earlier presumably prevents multiple calls of egress filters caused by an skb.
A test that reproduce this scenario shows that the BPF cgroup egress programs can be called multiple times for one skb if this ownership check is not there. So, we can not just remove this check.
Test Stacked Devices ====================
We use L2TP to build an environment of stacked devices. L2TP (Layer 2 Tunneling Protocol) is a tunneling protocol used to support virtual private networks (VPNs). It relays encapsulated packets; for example in UDP, to its peer by using a socket.
Using L2TP, packets are first sent through the IP stack and should then arrive at an L2TP device. The device will expand its skb header to encapsulate the packet. The skb will be sent back to the IP stack using the socket that was made for the L2TP session. After that, the routing process will occur once more, but this time for a new destination.
We changed tools/testing/selftests/net/l2tp.sh to set up a test environment using L2TP. The run_ping() function in l2tp.sh is where the main change occurred.
run_ping() { local desc="$1"
sleep 10 run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1 fc00:101::2 log_test $? 0 "IPv6 route through L2TP tunnel ${desc}" sleep 10 }
The test will use L2TP devices to send PING messages. These messages will have a message size of 227 bytes as a special label to distinguish them. This is not an ideal solution, but works.
During the execution of the test script, bpftrace was attached to ip6_finish_output() and l2tp_xmit_skb():
bpftrace -e ' kfunc:ip6_finish_output { time("%H:%M:%S: "); printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p skb->sk=%p\n", args->skb, args->skb->len, args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); } kfunc:l2tp_xmit_skb { time("%H:%M:%S: "); printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb, args->session->tunnel->sock); }'
The following is part of the output messages printed by bpftrace:
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275 cgroup=0xffff88810741f800 sk=0xffff888105f3b900 skb->sk=0xffff888105f3b900
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103dd6300 skb->sk=0xffff888105f3b900
16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275 cgroup=0xffffffff837741d0 sk=0xffff888101fe0000 skb->sk=0xffff888101fe0000
16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=0xffff88810741f800 sk=0xffff888103483180 skb->sk=0xffff888101fe0000
16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337 cgroup=(nil) sk=(nil) skb->sk=(nil)
The first four entries describe a PING message that was sent using the ping command, whereas the following four entries describe the response received. Multiple sockets are used to send one skb, including the socket used by the L2TP session. This can be observed.
Based on this information, it seems that the ownership check is designed to avoid multiple calls of egress filters caused by a single skb.
[0] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
dea6a4e1 |
| 23-Aug-2022 |
Stanislav Fomichev <[email protected]> |
bpf: Introduce cgroup_{common,current}_func_proto
Split cgroup_base_func_proto into the following:
* cgroup_common_func_proto - common helpers for all cgroup hooks * cgroup_current_func_proto - com
bpf: Introduce cgroup_{common,current}_func_proto
Split cgroup_base_func_proto into the following:
* cgroup_common_func_proto - common helpers for all cgroup hooks * cgroup_current_func_proto - common helpers for all cgroup hooks running in the process context (== have meaningful 'current').
Move bpf_{g,s}et_retval and other cgroup-related helpers into kernel/bpf/cgroup.c so they closer to where they are being used.
Signed-off-by: Stanislav Fomichev <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5 |
|
| #
69fd337a |
| 28-Jun-2022 |
Stanislav Fomichev <[email protected]> |
bpf: per-cgroup lsm flavor
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching to other per-cgroup hooks. New BPF_LSM_CGROUP is added to tri
bpf: per-cgroup lsm flavor
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching to other per-cgroup hooks. New BPF_LSM_CGROUP is added to trigger new mode; the actual lsm hook we attach to is signaled via existing attach_btf_id.
For the hooks that have 'struct socket' or 'struct sock' as its first argument, we use the cgroup associated with that socket. For the rest, we use 'current' cgroup (this is all on default hierarchy == v2 only). Note that for some hooks that work on 'struct sock' we still take the cgroup from 'current' because some of them work on the socket that hasn't been properly initialized yet.
Behind the scenes, we allocate a shim program that is attached to the trampoline and runs cgroup effective BPF programs array. This shim has some rudimentary ref counting and can be shared between several programs attaching to the same lsm hook from different cgroups.
Note that this patch bloats cgroup size because we add 211 cgroup_bpf_attach_type(s) for simplicity sake. This will be addressed in the subsequent patch.
Also note that we only add non-sleepable flavor for now. To enable sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu, shim programs have to be freed via trace rcu, cgroup_bpf.effective should be also trace-rcu-managed + maybe some other changes that I'm not aware of.
Reviewed-by: Martin KaFai Lau <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
00442143 |
| 28-Jun-2022 |
Stanislav Fomichev <[email protected]> |
bpf: convert cgroup_bpf.progs to hlist
This lets us reclaim some space to be used by new cgroup lsm slots.
Before: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 1
bpf: convert cgroup_bpf.progs to hlist
This lets us reclaim some space to be used by new cgroup lsm slots.
Before: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct list_head progs[23]; /* 184 368 */ /* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */ u32 flags[23]; /* 552 92 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ struct list_head storages; /* 648 16 */ struct bpf_prog_array * inactive; /* 664 8 */ struct percpu_ref refcnt; /* 672 16 */ struct work_struct release_work; /* 688 32 */
/* size: 720, cachelines: 12, members: 7 */ /* sum members: 716, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ };
After: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct hlist_head progs[23]; /* 184 184 */ /* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */ u8 flags[23]; /* 368 23 */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct list_head storages; /* 392 16 */ struct bpf_prog_array * inactive; /* 408 8 */ struct percpu_ref refcnt; /* 416 16 */ struct work_struct release_work; /* 432 72 */
/* size: 504, cachelines: 8, members: 7 */ /* sum members: 503, holes: 1, sum holes: 1 */ /* last cacheline: 56 bytes */ };
Suggested-by: Jakub Sitnicki <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Reviewed-by: Martin KaFai Lau <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5 |
|
| #
d9d31cf8 |
| 25-Apr-2022 |
Stanislav Fomichev <[email protected]> |
bpf: Use bpf_prog_run_array_cg_flags everywhere
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and use it everywhere. check_return_code already enforces sane return ranges for all cgrou
bpf: Use bpf_prog_run_array_cg_flags everywhere
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and use it everywhere. check_return_code already enforces sane return ranges for all cgroup types. (only egress and bind hooks have uncanonical return ranges, the rest is using [0, 1])
No functional changes.
v2: - 'func_ret & 1' under explicit test (Andrii & Martin)
Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2 |
|
| #
46531a30 |
| 27-Jan-2022 |
Pavel Begunkov <[email protected]> |
cgroup/bpf: fast path skb BPF filtering
Even though there is a static key protecting from overhead from cgroup-bpf skb filtering when there is nothing attached, in many cases it's not enough as regi
cgroup/bpf: fast path skb BPF filtering
Even though there is a static key protecting from overhead from cgroup-bpf skb filtering when there is nothing attached, in many cases it's not enough as registering a filter for one type will ruin the fast path for all others. It's observed in production servers I've looked at but also in laptops, where registration is done during init by systemd or something else.
Add a per-socket fast path check guarding from such overhead. This affects both receive and transmit paths of TCP, UDP and other protocols. It showed ~1% tx/s improvement in small payload UDP send benchmarks using a real NIC and in a server environment and the number jumps to 2-3% for preemtible kernels.
Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6 |
|
| #
fd1740b6 |
| 16-Dec-2021 |
Jakub Kicinski <[email protected]> |
bpf: Remove the cgroup -> bpf header dependecy
Remove the dependency from cgroup-defs.h to bpf-cgroup.h and bpf.h. This reduces the incremental build size of x86 allmodconfig after bpf.h was touched
bpf: Remove the cgroup -> bpf header dependecy
Remove the dependency from cgroup-defs.h to bpf-cgroup.h and bpf.h. This reduces the incremental build size of x86 allmodconfig after bpf.h was touched from ~17k objects rebuilt to ~5k objects. bpf.h is 2.2kLoC and is modified relatively often.
We need a new header with just the definition of struct cgroup_bpf and enum cgroup_bpf_attach_type, this is akin to cgroup-defs.h.
Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15 |
|
| #
588e5d87 |
| 29-Oct-2021 |
He Fengqing <[email protected]> |
cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf") cgroup_bpf_*() called from kernel/bpf/syscall.c, but now the
cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf") cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only used in kernel/bpf/cgroup.c, so move these function to kernel/bpf/cgroup.c, like cgroup_bpf_replace().
Signed-off-by: He Fengqing <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
9dfc685e |
| 26-Oct-2021 |
Eric Dumazet <[email protected]> |
inet: remove races in inet{6}_getname()
syzbot reported data-races in inet_getname() multiple times, it is time we fix this instead of pretending applications should not trigger them.
getsockname()
inet: remove races in inet{6}_getname()
syzbot reported data-races in inet_getname() multiple times, it is time we fix this instead of pretending applications should not trigger them.
getsockname() and getpeername() are not really considered fast path.
v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration needed when CONFIG_CGROUP_BPF=n, as reported by kernel test robot <[email protected]>
syzbot typical report:
BUG: KCSAN: data-race in __inet_hash_connect / inet_getname
write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1: __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728 __sys_connect_file net/socket.c:1896 [inline] __sys_connect+0x254/0x290 net/socket.c:1913 __do_sys_connect net/socket.c:1923 [inline] __se_sys_connect net/socket.c:1920 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1920 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0: inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946 __do_sys_getsockname net/socket.c:1961 [inline] __se_sys_getsockname net/socket.c:1958 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000 -> 0xdee0
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7 |
|
| #
6fc88c35 |
| 19-Aug-2021 |
Dave Marchevsky <[email protected]> |
bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf attach types and a function to map bpf_attach_type values to the
bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf attach types and a function to map bpf_attach_type values to the new enum. Inspired by netns_bpf_attach_type.
Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever possible. Functionality is unchanged as attach_type_to_prog_type switches in bpf/syscall.c were preventing non-cgroup programs from making use of the invalid cgroup_bpf array slots.
As a result struct cgroup_bpf uses 504 fewer bytes relative to when its arrays were sized using MAX_BPF_ATTACH_TYPE.
bpf_cgroup_storage is notably not migrated as struct bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type member which is not meant to be opaque. Similarly, bpf_cgroup_link continues to report its bpf_attach_type member to userspace via fdinfo and bpf_link_info.
To ease disambiguation, bpf_attach_type variables are renamed from 'type' to 'atype' when changed to cgroup_bpf_attach_type.
Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.14-rc6 |
|
| #
a2baf4e8 |
| 10-Aug-2021 |
Yonghong Song <[email protected]> |
bpf: Fix potentially incorrect results with bpf_get_local_storage()
Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage(
bpf: Fix potentially incorrect results with bpf_get_local_storage()
Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage() helper so different tasks won't mess up with each other's percpu local storage.
The percpu data contains 8 slots so it can hold up to 8 contexts (same or different tasks), for 8 different program runs, at the same time. This in general is sufficient. But our internal testing showed the following warning multiple times:
[...] warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193 __cgroup_bpf_run_filter_sock_ops+0x13e/0x180 RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180 <IRQ> tcp_call_bpf.constprop.99+0x93/0xc0 tcp_conn_request+0x41e/0xa50 ? tcp_rcv_state_process+0x203/0xe00 tcp_rcv_state_process+0x203/0xe00 ? sk_filter_trim_cap+0xbc/0x210 ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160 tcp_v6_do_rcv+0x181/0x3e0 tcp_v6_rcv+0xc65/0xcb0 ip6_protocol_deliver_rcu+0xbd/0x450 ip6_input_finish+0x11/0x20 ip6_input+0xb5/0xc0 ip6_sublist_rcv_finish+0x37/0x50 ip6_sublist_rcv+0x1dc/0x270 ipv6_list_rcv+0x113/0x140 __netif_receive_skb_list_core+0x1a0/0x210 netif_receive_skb_list_internal+0x186/0x2a0 gro_normal_list.part.170+0x19/0x40 napi_complete_done+0x65/0x150 mlx5e_napi_poll+0x1ae/0x680 __napi_poll+0x25/0x120 net_rx_action+0x11e/0x280 __do_softirq+0xbb/0x271 irq_exit_rcu+0x97/0xa0 common_interrupt+0x7f/0xa0 </IRQ> asm_common_interrupt+0x1e/0x40 RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac ? __cgroup_bpf_run_filter_skb+0x378/0x4e0 ? do_softirq+0x34/0x70 ? ip6_finish_output2+0x266/0x590 ? ip6_finish_output+0x66/0xa0 ? ip6_output+0x6c/0x130 ? ip6_xmit+0x279/0x550 ? ip6_dst_check+0x61/0xd0 [...]
Using drgn [0] to dump the percpu buffer contents showed that on this CPU slot 0 is still available, but slots 1-7 are occupied and those tasks in slots 1-7 mostly don't exist any more. So we might have issues in bpf_cgroup_storage_unset().
Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset(). Currently, it tries to unset "current" slot with searching from the start. So the following sequence is possible:
1. A task is running and claims slot 0 2. Running BPF program is done, and it checked slot 0 has the "task" and ready to reset it to NULL (not yet). 3. An interrupt happens, another BPF program runs and it claims slot 1 with the *same* task. 4. The unset() in interrupt context releases slot 0 since it matches "task". 5. Interrupt is done, the task in process context reset slot 0.
At the end, slot 1 is not reset and the same process can continue to occupy slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF program won't be able to claim an empty slot and a warning will be issued.
To fix the issue, for unset() function, we should traverse from the last slot to the first. This way, the above issue can be avoided.
The same reverse traversal should also be done in bpf_get_local_storage() helper itself. Otherwise, incorrect local storage may be returned to BPF program.
[0] https://github.com/osandov/drgn
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2 |
|
| #
c7603cfa |
| 12-Jul-2021 |
Andrii Nakryiko <[email protected]> |
bpf: Add ambient BPF runtime context stored in current
b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed the problem with cgroup-local storage use in BPF by
bpf: Add ambient BPF runtime context stored in current
b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed the problem with cgroup-local storage use in BPF by pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate possible BPF program preemptions and nested executions.
While this seems to work good in practice, it introduces new and unnecessary failure mode in which not all BPF programs might be executed if we fail to find an unused slot for cgroup storage, however unlikely it is. It might also not be so unlikely when/if we allow sleepable cgroup BPF programs in the future.
Further, the way that cgroup storage is implemented as ambiently-available property during entire BPF program execution is a convenient way to pass extra information to BPF program and helpers without requiring user code to pass around extra arguments explicitly. So it would be good to have a generic solution that can allow implementing this without arbitrary restrictions. Ideally, such solution would work for both preemptable and sleepable BPF programs in exactly the same way.
This patch introduces such solution, bpf_run_ctx. It adds one pointer field (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of macros in such a way that it always stays valid throughout BPF program execution. BPF program preemption is handled by remembering previous current->bpf_ctx value locally while executing nested BPF program and restoring old value after nested BPF program finishes. This is handled by two helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are supposed to be used before and after BPF program runs, respectively.
Restoring old value of the pointer handles preemption, while bpf_run_ctx pointer being a property of current task_struct naturally solves this problem for sleepable BPF programs by "following" BPF program execution as it is scheduled in and out of CPU. It would even allow CPU migration of BPF programs, even though it's not currently allowed by BPF infra.
This patch cleans up cgroup local storage handling as a first application. The design itself is generic, though, with bpf_run_ctx being an empty struct that is supposed to be embedded into a specific struct for a given BPF program type (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand this mechanism for other uses within tracing BPF programs.
To verify that this change doesn't revert the fix to the original cgroup storage issue, I ran the same repro as in the original report ([0]) and didn't get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
[0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6 |
|
| #
2daae896 |
| 01-Apr-2021 |
Wan Jiabing <[email protected]> |
bpf, cgroup: Delete repeated struct bpf_prog declaration
struct bpf_prog is declared twice. There is one declaration which is independent on the macro at 18th line. So the below one is not needed th
bpf, cgroup: Delete repeated struct bpf_prog declaration
struct bpf_prog is declared twice. There is one declaration which is independent on the macro at 18th line. So the below one is not needed though. Remove the duplicate.
Signed-off-by: Wan Jiabing <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.12-rc5 |
|
| #
b910eaaa |
| 23-Mar-2021 |
Yonghong Song <[email protected]> |
bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper
Jiri Olsa reported a bug ([1]) in kernel where cgroup local storage pointer may be NULL in bpf_get_local_storage() helper. There a
bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper
Jiri Olsa reported a bug ([1]) in kernel where cgroup local storage pointer may be NULL in bpf_get_local_storage() helper. There are two issues uncovered by this bug: (1). kprobe or tracepoint prog incorrectly sets cgroup local storage before prog run, (2). due to change from preempt_disable to migrate_disable, preemption is possible and percpu storage might be overwritten by other tasks.
This issue (1) is fixed in [2]. This patch tried to address issue (2). The following shows how things can go wrong: task 1: bpf_cgroup_storage_set() for percpu local storage preemption happens task 2: bpf_cgroup_storage_set() for percpu local storage preemption happens task 1: run bpf program
task 1 will effectively use the percpu local storage setting by task 2 which will be either NULL or incorrect ones.
Instead of just one common local storage per cpu, this patch fixed the issue by permitting 8 local storages per cpu and each local storage is identified by a task_struct pointer. This way, we allow at most 8 nested preemption between bpf_cgroup_storage_set() and bpf_cgroup_storage_unset(). The percpu local storage slot is released (calling bpf_cgroup_storage_unset()) by the same task after bpf program finished running. bpf_test_run() is also fixed to use the new bpf_cgroup_storage_set() interface.
The patch is tested on top of [2] with reproducer in [1]. Without this patch, kernel will emit error in 2-3 minutes. With this patch, after one hour, still no error.
[1] https://lore.kernel.org/bpf/CAKH8qBuXCfUz=w8L+Fj74OaUpbosO29niYwTki7e3Ag044_aww@mail.gmail.com/T [2] https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Roman Gushchin <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6 |
|
| #
77241217 |
| 27-Jan-2021 |
Stanislav Fomichev <[email protected]> |
bpf: Allow rewriting to ports under ip_unprivileged_port_start
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port to the privileged ones (< ip_unprivileged_port_start), but it will
bpf: Allow rewriting to ports under ip_unprivileged_port_start
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port to the privileged ones (< ip_unprivileged_port_start), but it will be rejected later on in the __inet_bind or __inet6_bind.
Let's add another return value to indicate that CAP_NET_BIND_SERVICE check should be ignored. Use the same idea as we currently use in cgroup/egress where bit #1 indicates CN. Instead, for cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should be bypassed.
v5: - rename flags to be less confusing (Andrey Ignatov) - rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags and accept BPF_RET_SET_CN (no behavioral changes)
v4: - Add missing IPv6 support (Martin KaFai Lau)
v3: - Update description (Martin KaFai Lau) - Fix capability restore in selftest (Martin KaFai Lau)
v2: - Switch to explicit return code (Martin KaFai Lau)
Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Martin KaFai Lau <[email protected]> Acked-by: Andrey Ignatov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|