|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
88044230 |
| 04-Mar-2025 |
Peilin Ye <[email protected]> |
bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags:
#define BPF_LOAD_ACQ
bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags:
#define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110
A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100).
Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110).
Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot).
An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends.
Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set).
As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian):
db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0))
opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ
Similarly, a 16-bit BPF store-release:
cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2)
opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL
In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena.
[1] https://lore.kernel.org/all/[email protected]/
Acked-by: Eduard Zingerman <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Cc: kernel test robot <[email protected]> Signed-off-by: Peilin Ye <[email protected]> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
e723608b |
| 04-Mar-2025 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have
bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound.
We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched.
The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration.
Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program.
We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86.
Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported.
The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate.
Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled.
+-----------------------------+--------------+--------------+------------------+ | Loop type | Iterations | Time (ms) | Time/iter (ns) | +-----------------------------|--------------+--------------+------------------+ | may_goto | 8388608 | 3 | 0.36 | | timed_may_goto (count=65535)| 589674932 | 250 | 0.42 | | bpf_for | 8388608 | 10 | 1.19 | +-----------------------------+--------------+--------------+------------------+
This gives a good approximation at low overhead while staying close to the current implementation.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
4e4136c6 |
| 25-Feb-2025 |
Amery Hung <[email protected]> |
selftests/bpf: Test gen_pro/epilogue that generate kfuncs
Test gen_prologue and gen_epilogue that generate kfuncs that have not been seen in the main program.
The main bpf program and return value
selftests/bpf: Test gen_pro/epilogue that generate kfuncs
Test gen_prologue and gen_epilogue that generate kfuncs that have not been seen in the main program.
The main bpf program and return value checks are identical to pro_epilogue.c introduced in commit 47e69431b57a ("selftests/bpf: Test gen_prologue and gen_epilogue"). However, now when bpf_testmod_st_ops detects a program name with prefix "test_kfunc_", it generates slightly different prologue and epilogue: They still add 1000 to args->a in prologue, add 10000 to args->a and set r0 to 2 * args->a in epilogue, but involve kfuncs.
At high level, the alternative version of prologue and epilogue look like this:
cgrp = bpf_cgroup_from_id(0); if (cgrp) bpf_cgroup_release(cgrp); else /* Perform what original bpf_testmod_st_ops prologue or * epilogue does */
Since 0 is never a valid cgroup id, the original prologue or epilogue logic will be performed. As a result, the __retval check should expect the exact same return value.
Signed-off-by: Amery Hung <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
fd93eaff |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will call the sockops BPF program without holding the sock
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will call the sockops BPF program without holding the sock lock.
This breaks the current assumption that all sock ops programs will hold the sock lock. The sock's fields of the uapi's bpf_sock_ops requires this assumption.
To address this, a new "u8 is_locked_tcp_sock;" field is added. This patch sets it in the current sock_ops callbacks. The "is_fullsock" test is then replaced by the "is_locked_tcp_sock" test during sock_ops_convert_ctx_access().
The new TX timestamping callbacks added in the subsequent patch will not have this set. This will prevent unsafe access from the new timestamping callbacks.
Potentially, we could allow read-only access. However, this would require identifying which callback is read-safe-only and also requires additional BPF instruction rewrites in the covert_ctx. Since the BPF program can always read everything from a socket (e.g., by using bpf_core_cast), this patch keeps it simple and disables all read and write access to any socket fields through the bpf_sock_ops UAPI from the new TX timestamping callback.
Moreover, note that some of the fields in bpf_sock_ops are specific to tcp_sock, and sock_ops currently only supports tcp_sock. In the future, UDP timestamping will be added, which will also break this assumption. The same idea used in this patch will be reused. Considering that the current sock_ops only supports tcp_sock, the variable is named is_locked_"tcp"_sock.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
b238e187 |
| 10-Dec-2024 |
Eduard Zingerman <[email protected]> |
bpf: refactor bpf_helper_changes_pkt_data to use helper number
Use BPF helper number instead of function pointer in bpf_helper_changes_pkt_data(). This would simplify usage of this function in verif
bpf: refactor bpf_helper_changes_pkt_data to use helper number
Use BPF helper number instead of function pointer in bpf_helper_changes_pkt_data(). This would simplify usage of this function in verifier.c:check_cfg() (in a follow-up patch), where only helper number is easily available and there is no real need to lookup helper proto.
Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2 |
|
| #
7cd1107f |
| 03-Dec-2024 |
Alexander Lobakin <[email protected]> |
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Addre
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments.
Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12 |
|
| #
a76ab573 |
| 12-Nov-2024 |
Yonghong Song <[email protected]> |
bpf: Find eligible subprogs for private stack support
Private stack will be allocated with percpu allocator in jit time. To avoid complexity at runtime, only one copy of private stack is available p
bpf: Find eligible subprogs for private stack support
Private stack will be allocated with percpu allocator in jit time. To avoid complexity at runtime, only one copy of private stack is available per cpu per prog. So runtime recursion check is necessary to avoid stack corruption.
Current private stack only supports kprobe/perf_event/tp/raw_tp which has recursion check in the kernel, and prog types that use bpf trampoline recursion check. For trampoline related prog types, currently only tracing progs have recursion checking.
To avoid complexity, all async_cb subprogs use normal kernel stack including those subprogs used by both main prog subtree and async_cb subtree. Any prog having tail call also uses kernel stack.
To avoid jit penalty with private stack support, a subprog stack size threshold is set such that only if the stack size is no less than the threshold, private stack is supported. The current threshold is 64 bytes. This avoids jit penality if the stack usage is small.
A useless 'continue' is also removed from a loop in func check_max_stack_depth().
Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
169c3176 |
| 29-Aug-2024 |
Martin KaFai Lau <[email protected]> |
bpf: Add gen_epilogue to bpf_verifier_ops
This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar to the existing .gen_prologue. Instead of allowing a subsystem to run code at the beg
bpf: Add gen_epilogue to bpf_verifier_ops
This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar to the existing .gen_prologue. Instead of allowing a subsystem to run code at the beginning of a bpf prog, it allows the subsystem to run code just before the bpf prog exit.
One of the use case is to allow the upcoming bpf qdisc to ensure that the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc struct_ops implementation could either fix it up or drop the skb. Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd has sane value (e.g. non zero).
The epilogue can do the useful thing (like checking skb->dev) if it can access the bpf prog's ctx. Unlike prologue, r1 may not hold the ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue has returned some instructions in the "epilogue_buf".
The existing .gen_prologue is done in convert_ctx_accesses(). The new .gen_epilogue is done in the convert_ctx_accesses() also. When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched with the earlier generated "epilogue_buf". The epilogue patching is only done for the main prog.
Only one epilogue will be patched to the main program. When the bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used to goto the earlier patched epilogue. Majority of the archs support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch, powerpc and arc. This patch keeps it simple and always use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to generate the (BPF_JMP32 | BPF_JA) insn.
Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
10b2a44c |
| 02-Aug-2024 |
Eric Dumazet <[email protected]> |
inet6: constify 'struct net' parameter of various lookup helpers
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v6() - __inet6_lookup_established() - inet6_lookup_reu
inet6: constify 'struct net' parameter of various lookup helpers
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v6() - __inet6_lookup_established() - inet6_lookup_reuseport() - inet6_lookup_listener() - inet6_lookup_run_sk_lookup() - __inet6_lookup() - inet6_lookup()
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
d4433e8b |
| 02-Aug-2024 |
Eric Dumazet <[email protected]> |
inet: constify 'struct net' parameter of various lookup helpers
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v4() - inet_lookup_reuseport() - inet_lhash2_lookup() -
inet: constify 'struct net' parameter of various lookup helpers
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v4() - inet_lookup_reuseport() - inet_lhash2_lookup() - inet_lookup_run_sk_lookup() - __inet_lookup_listener() - __inet_lookup_established()
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
d839a731 |
| 28-Jun-2024 |
Sebastian Andrzej Siewior <[email protected]> |
net: Optimize xdp_do_flush() with bpf_net_context infos.
Every NIC driver utilizing XDP should invoke xdp_do_flush() after processing all packages. With the introduction of the bpf_net_context logic
net: Optimize xdp_do_flush() with bpf_net_context infos.
Every NIC driver utilizing XDP should invoke xdp_do_flush() after processing all packages. With the introduction of the bpf_net_context logic the flush lists (for dev, CPU-map and xsk) are lazy initialized only if used. However xdp_do_flush() tries to flush all three of them so all three lists are always initialized and the likely empty lists are "iterated". Without the usage of XDP but with CONFIG_DEBUG_NET the lists are also initialized due to xdp_do_check_flushed().
Jakub suggest to utilize the hints in bpf_net_context and avoid invoking the flush function. This will also avoiding initializing the lists which are otherwise unused.
Introduce bpf_net_ctx_get_all_used_flush_lists() to return the individual list if not-empty. Use the logic in xdp_do_flush() and xdp_do_check_flushed(). Remove the not needed .*_check_flush().
Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3 |
|
| #
7e1f4eb9 |
| 04-Apr-2024 |
Arnd Bergmann <[email protected]> |
kallsyms: rework symbol lookup return codes
Building with W=1 in some configurations produces a false positive warning for kallsyms:
kernel/kallsyms.c: In function '__sprint_symbol.isra': kernel/ka
kallsyms: rework symbol lookup return codes
Building with W=1 in some configurations produces a false positive warning for kallsyms:
kernel/kallsyms.c: In function '__sprint_symbol.isra': kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict] 503 | strcpy(buffer, name); | ^~~~~~~~~~~~~~~~~~~~
This originally showed up while building with -O3, but later started happening in other configurations as well, depending on inlining decisions. The underlying issue is that the local 'name' variable is always initialized to the be the same as 'buffer' in the called functions that fill the buffer, which gcc notices while inlining, though it could see that the address check always skips the copy.
The calling conventions here are rather unusual, as all of the internal lookup functions (bpf_address_lookup, ftrace_mod_address_lookup, ftrace_func_address_lookup, module_address_lookup and kallsyms_lookup_buildid) already use the provided buffer and either return the address of that buffer to indicate success, or NULL for failure, but the callers are written to also expect an arbitrary other buffer to be returned.
Rework the calling conventions to return the length of the filled buffer instead of its address, which is simpler and easier to follow as well as avoiding the warning. Leave only the kallsyms_lookup() calling conventions unchanged, since that is called from 16 different functions and adapting this would be a much bigger change.
Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Tested-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
3f9fe37d |
| 20-Jun-2024 |
Sebastian Andrzej Siewior <[email protected]> |
net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.
The per-CPU flush lists, which are accessed from within the NAPI callback (xdp_do_flush() for instance), are per-CPU. There are subjec
net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.
The per-CPU flush lists, which are accessed from within the NAPI callback (xdp_do_flush() for instance), are per-CPU. There are subject to the same problem as struct bpf_redirect_info.
Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the access. The lists initialized on first usage (similar to bpf_net_ctx_get_ri()).
Cc: "Björn Töpel" <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: Jonathan Lemon <[email protected]> Cc: KP Singh <[email protected]> Cc: Maciej Fijalkowski <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Yonghong Song <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
401cb7da |
| 20-Jun-2024 |
Sebastian Andrzej Siewior <[email protected]> |
net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes de
net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists.
The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking.
PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided.
Create a struct bpf_net_context which contains struct bpf_redirect_info. Define the variable on stack, use bpf_net_ctx_set() to save a pointer to it, bpf_net_ctx_clear() removes it again. The bpf_net_ctx_set() may nest. For instance a function can be used from within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and NET_TX_SOFTIRQ which does not. Therefore only the first invocations updates the pointer. Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. The returned data structure is zero initialized to ensure nothing is leaked from stack. This is done on first usage of the struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to note that initialisation is required. First invocation of bpf_net_ctx_get_ri() will memset() the data structure and update bpf_redirect_info::kern_flags. bpf_redirect_info::nh is excluded from memset because it is only used once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is moved past nh to exclude it from memset.
The pointer to bpf_net_context is saved task's task_struct. Using always the bpf_net_context approach has the advantage that there is almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
Cc: Andrii Nakryiko <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Yonghong Song <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
9919c5c9 |
| 15-Jun-2024 |
Rafael Passos <[email protected]> |
bpf: remove unused parameter in bpf_jit_binary_pack_finalize
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function was taking an extra bpf_prog parameter that went unused. This removve
bpf: remove unused parameter in bpf_jit_binary_pack_finalize
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function was taking an extra bpf_prog parameter that went unused. This removves it and updates the callers accordingly.
Signed-off-by: Rafael Passos <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
2c1713a8 |
| 18-May-2024 |
Thomas Weißschuh <[email protected]> |
bpf: constify member bpf_sysctl_kern:: Table
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers, fo
bpf: constify member bpf_sysctl_kern:: Table
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers, for which bpf_sysctl_kern::table is also used.
As the function prototype of all sysctl handlers throughout the tree needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch this utility type which is not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Joel Granados <[email protected]> Link: https://lore.kernel.org/bpf/20240518-sysctl-const-handler-bpf-v1-1-f0d7186743c1@weissschuh.net
show more ...
|
| #
2ddec2c8 |
| 02-May-2024 |
Puranjay Mohan <[email protected]> |
riscv, bpf: inline bpf_get_smp_processor_id()
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
RISCV saves the pointer to the CPU's task_struct in the TP (thread pointer) regist
riscv, bpf: inline bpf_get_smp_processor_id()
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
RISCV saves the pointer to the CPU's task_struct in the TP (thread pointer) register. This makes it trivial to get the CPU's processor id. As thread_info is the first member of task_struct, we can read the processor id from TP + offsetof(struct thread_info, cpu).
RISCV64 JIT output for `call bpf_get_smp_processor_id` ======================================================
Before After -------- -------
auipc t1,0x848c ld a5,32(tp) jalr 604(t1) mv a5,a0
Benchmark using [1] on Qemu.
./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
+---------------+------------------+------------------+--------------+ | Name | Before | After | % change | |---------------+------------------+------------------+--------------| | glob-arr-inc | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s | + 24.04% | | arr-inc | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s | + 23.56% | | hash-inc | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s | + 32.18% | +---------------+------------------+------------------+--------------+
NOTE: This benchmark includes changes from this patch and the previous patch that implemented the per-cpu insn.
[1] https://github.com/anakryiko/linux/commit/8dec900975ef
Signed-off-by: Puranjay Mohan <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Björn Töpel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
66e13b61 |
| 24-Apr-2024 |
Puranjay Mohan <[email protected]> |
bpf: verifier: prevent userspace memory access
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all su
bpf: verifier: prevent userspace memory access
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it.
Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region.
The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default.
The implementation is as follows:
REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset);
Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison.
The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace.
Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer:
SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; }
BPF Program before | BPF Program after ------------------ | -----------------
0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit
As you can see from above, in the best case (off=0), 5 extra instructions are emitted.
Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it.
x86-64 JIT ========== JIT's Instrumentation (upstream) ---------------------
0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret
The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT.
ARM64 JIT =========
No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- --------------------------
0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10
There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0).
RISC-V JIT (RISCV_ISA_C Disabled) ==========
No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- --------------------------
0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret
There are 7 extra instructions added in RISC-V.
Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
d503a04f |
| 05-Apr-2024 |
Alexei Starovoitov <[email protected]> |
bpf: Add support for certain atomics in bpf_arena to x86 JIT
Support atomics in bpf_arena that can be JITed as a single x86 instruction. Instructions that are JITed as loops are not supported at the
bpf: Add support for certain atomics in bpf_arena to x86 JIT
Support atomics in bpf_arena that can be JITed as a single x86 instruction. Instructions that are JITed as loops are not supported at the moment, since they require more complex extable and loop logic.
JITs can choose to do smarter things with bpf_jit_supports_insn(). Like arm64 may decide to support all bpf atomics instructions when emit_lse_atomic is available and none in ll_sc mode.
bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and other such callbacks can be replaced with bpf_jit_supports_insn() in the future.
Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
| #
7bdbf744 |
| 02-Apr-2024 |
Andrii Nakryiko <[email protected]> |
bpf: add special internal-only MOV instruction to resolve per-CPU addrs
Add a new BPF instruction for resolving absolute addresses of per-CPU data from their per-CPU offsets. This instruction is int
bpf: add special internal-only MOV instruction to resolve per-CPU addrs
Add a new BPF instruction for resolving absolute addresses of per-CPU data from their per-CPU offsets. This instruction is internal-only and users are not allowed to use them directly. They will only be used for internal inlining optimizations for now between BPF verifier and BPF JITs.
We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish them from positive ones used by user-exposed instructions.
Such instruction performs a resolution of a per-CPU offset stored in a register to a valid kernel address which can be dereferenced. It is useful in any use case where absolute address of a per-CPU data has to be resolved (e.g., in inlining bpf_map_lookup_elem()).
BPF disassembler is also taught to recognize them to support dumping final BPF assembly code (non-JIT'ed version).
Add arch-specific way for BPF JITs to mark support for this instructions.
This patch also adds support for these instructions in x86-64 BPF JIT.
Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
ce09cbdd |
| 02-Apr-2024 |
Jose Fernandez <[email protected]> |
bpf: Improve program stats run-time calculation
This patch improves the run-time calculation for program stats by capturing the duration as soon as possible after the program returns.
Previously, t
bpf: Improve program stats run-time calculation
This patch improves the run-time calculation for program stats by capturing the duration as soon as possible after the program returns.
Previously, the duration included u64_stats_t operations. While the instrumentation overhead is part of the total time spent when stats are enabled, distinguishing between the program's native execution time and the time spent due to instrumentation is crucial for accurate performance analysis.
By making this change, the patch facilitates more precise optimization of BPF programs, enabling users to understand their performance in environments without stats enabled.
I used a virtualized environment to measure the run-time over one minute for a basic raw_tracepoint/sys_enter program, which just increments a local counter. Although the virtualization introduced some performance degradation that could affect the results, I observed approximately a 16% decrease in average run-time reported by stats with this change (310 -> 260 nsec).
Signed-off-by: Jose Fernandez <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.9-rc2, v6.9-rc1 |
|
| #
770546ae |
| 24-Mar-2024 |
Puranjay Mohan <[email protected]> |
bpf: implement insn_is_cast_user() helper for JITs
Implement a helper function to check if an instruction is addr_space_cast from as(0) to as(1). Use this helper in the x86 JIT.
Other JITs can use
bpf: implement insn_is_cast_user() helper for JITs
Implement a helper function to check if an instruction is addr_space_cast from as(0) to as(1). Use this helper in the x86 JIT.
Other JITs can use this helper when they add support for this instruction.
Signed-off-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.8 |
|
| #
e60adf51 |
| 08-Mar-2024 |
Christophe Leroy <[email protected]> |
bpf: Take return from set_memory_rox() into account with bpf_jit_binary_lock_ro()
set_memory_rox() can fail, leaving memory unprotected.
Check return and bail out when bpf_jit_binary_lock_ro() retu
bpf: Take return from set_memory_rox() into account with bpf_jit_binary_lock_ro()
set_memory_rox() can fail, leaving memory unprotected.
Check return and bail out when bpf_jit_binary_lock_ro() returns an error.
Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy <[email protected]> Cc: [email protected] <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Puranjay Mohan <[email protected]> Reviewed-by: Ilya Leoshkevich <[email protected]> # s390x Acked-by: Tiezhu Yang <[email protected]> # LoongArch Reviewed-by: Johan Almbladh <[email protected]> # MIPS Part Message-ID: <036b6393f23a2032ce75a1c92220b2afcb798d5d.1709850515.git.christophe.leroy@csgroup.eu> Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
7d2cc63e |
| 08-Mar-2024 |
Christophe Leroy <[email protected]> |
bpf: Take return from set_memory_ro() into account with bpf_prog_lock_ro()
set_memory_ro() can fail, leaving memory unprotected.
Check its return and take it into account as an error.
Link: https:
bpf: Take return from set_memory_ro() into account with bpf_prog_lock_ro()
set_memory_ro() can fail, leaving memory unprotected.
Check its return and take it into account as an error.
Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy <[email protected]> Cc: [email protected] <[email protected]> Reviewed-by: Kees Cook <[email protected]> Message-ID: <286def78955e04382b227cb3e4b6ba272a7442e3.1709850515.git.christophe.leroy@csgroup.eu> Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
142fd4d2 |
| 08-Mar-2024 |
Alexei Starovoitov <[email protected]> |
bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((addr
bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((address_space(N))). The addr_space=1 is reserved as bpf_arena address space.
rY = addr_space_cast(rX, 0, 1) is processed by the verifier and converted to normal 32-bit move: wX = wY
rY = addr_space_cast(rX, 1, 0) has to be converted by JIT:
aux_reg = upper_32_bits of arena->user_vm_start aux_reg <<= 32 wX = wY // clear upper 32 bits of dst register if (wX) // if not zero add upper bits of user_vm_start wX |= aux_reg
JIT can do it more efficiently:
mov dst_reg32, src_reg32 // 32-bit move shl dst_reg, 32 or dst_reg, user_vm_start rol dst_reg, 32 xor r11, r11 test dst_reg32, dst_reg32 // check if lower 32-bit are zero cmove r11, dst_reg // if so, set dst_reg to zero // Intel swapped src/dst register encoding in CMOVcc
Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|