|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
51d65049 |
| 19-Mar-2025 |
Juntong Deng <[email protected]> |
bpf: Add struct_ops context information to struct bpf_prog_aux
This patch adds struct_ops context information to struct bpf_prog_aux.
This context information will be used in the kfunc filter.
Cur
bpf: Add struct_ops context information to struct bpf_prog_aux
This patch adds struct_ops context information to struct bpf_prog_aux.
This context information will be used in the kfunc filter.
Currently the added context information includes struct_ops member offset and a pointer to struct bpf_struct_ops.
Signed-off-by: Juntong Deng <[email protected]> Signed-off-by: Amery Hung <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
0de20461 |
| 16-Mar-2025 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Implement verifier support for rqspinlock
Introduce verifier-side support for rqspinlock kfuncs. The first step is allowing bpf_res_spin_lock type to be defined in map values and allocated obje
bpf: Implement verifier support for rqspinlock
Introduce verifier-side support for rqspinlock kfuncs. The first step is allowing bpf_res_spin_lock type to be defined in map values and allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK field to recognize and validate.
Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only one of them (and at most one of them per-object, like before) must be present. The bpf_res_spin_lock can also be used to protect objects that require lock protection for their kfuncs, like BPF rbtree and linked list.
The verifier plumbing to simulate success and failure cases when calling the kfuncs is done by pushing a new verifier state to the verifier state stack which will verify the failure case upon calling the kfunc. The path where success is indicated creates all lock reference state and IRQ state (if necessary for irqsave variants). In the case of failure, the state clears the registers r0-r5, sets the return value, and skips kfunc processing, proceeding to the next instruction.
When marking the return value for success case, the value is marked as 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program, whenever user checks the return value as 'if (ret)' or 'if (ret < 0)' the verifier never traverses such branches for success cases, and would be aware that the lock is not held in such cases.
We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs are invoked. We introduce a kfunc_class state to avoid mixing lock irqrestore kfuncs with IRQ state created by bpf_local_irq_save.
With all this infrastructure, these kfuncs become usable in programs while satisfying all safety properties required by the kernel.
Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
97eb35f3 |
| 16-Mar-2025 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Introduce rqspinlock kfuncs
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock, and their irqsave/irqrestore variants, which wrap the rqspinlock APIs. bpf_res_spin_lock retur
bpf: Introduce rqspinlock kfuncs
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock, and their irqsave/irqrestore variants, which wrap the rqspinlock APIs. bpf_res_spin_lock returns a conditional result, depending on whether the lock was acquired (NULL is returned when lock acquisition succeeds, non-NULL upon failure). The memory pointed to by the returned pointer upon failure can be dereferenced after the NULL check to obtain the error code.
Instead of using the old bpf_spin_lock type, introduce a new type with the same layout, and the same alignment, but a different name to avoid type confusion.
Preemption is disabled upon successful lock acquisition, however IRQs are not. Special kfuncs can be introduced later to allow disabling IRQs when taking a spin lock. Resilient locks are safe against AA deadlocks, hence not disabling IRQs currently does not allow violation of kernel safety.
__irq_flag annotation is used to accept IRQ flags for the IRQ-variants, with the same semantics as existing bpf_local_irq_{save, restore}.
These kfuncs will require additional verifier-side support in subsequent commits, to allow programs to hold multiple locks at the same time.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
ae0a457f |
| 18-Mar-2025 |
Emil Tsalapatis <[email protected]> |
bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like s
bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like sched_ext schedulers. When the helper is available, provide its bpf_func_proto directly from the bpf base_proto.
Signed-off-by: Emil Tsalapatis (Meta) <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
88044230 |
| 04-Mar-2025 |
Peilin Ye <[email protected]> |
bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags:
#define BPF_LOAD_ACQ
bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags:
#define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110
A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100).
Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110).
Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot).
An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends.
Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set).
As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian):
db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0))
opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ
Similarly, a 16-bit BPF store-release:
cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2)
opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL
In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena.
[1] https://lore.kernel.org/all/[email protected]/
Acked-by: Eduard Zingerman <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Cc: kernel test robot <[email protected]> Signed-off-by: Peilin Ye <[email protected]> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
e723608b |
| 04-Mar-2025 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have
bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound.
We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched.
The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration.
Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program.
We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86.
Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported.
The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate.
Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled.
+-----------------------------+--------------+--------------+------------------+ | Loop type | Iterations | Time (ms) | Time/iter (ns) | +-----------------------------|--------------+--------------+------------------+ | may_goto | 8388608 | 3 | 0.36 | | timed_may_goto (count=65535)| 589674932 | 250 | 0.42 | | bpf_for | 8388608 | 10 | 1.19 | +-----------------------------+--------------+--------------+------------------+
This gives a good approximation at low overhead while staying close to the current implementation.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5 |
|
| #
e2d8f560 |
| 01-Mar-2025 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Summarize sleepable global subprogs
The verifier currently does not permit global subprog calls when a lock is held, preemption is disabled, or when IRQs are disabled. This is because we don't
bpf: Summarize sleepable global subprogs
The verifier currently does not permit global subprog calls when a lock is held, preemption is disabled, or when IRQs are disabled. This is because we don't know whether the global subprog calls sleepable functions or not.
In case of locks, there's an additional reason: functions called by the global subprog may hold additional locks etc. The verifier won't know while verifying the global subprog whether it was called in context where a spin lock is already held by the program.
Perform summarization of the sleepable nature of a global subprog just like changes_pkt_data and then allow calls to global subprogs for non-sleepable ones from atomic context.
While making this change, I noticed that RCU read sections had no protection against sleepable global subprog calls, include it in the checks and fix this while we're at it.
Care needs to be taken to not allow global subprog calls when regular bpf_spin_lock is held. When resilient spin locks is held, we want to potentially have this check relaxed, but not for now.
Also make sure extensions freplacing global functions cannot do so in case the target is non-sleepable, but the extension is. The other combination is ok.
Tests are included in the next patch to handle all special conditions.
Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
c9eb8102 |
| 22-Feb-2025 |
Alexei Starovoitov <[email protected]> |
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
Use try_alloc_pages() and free_pages_nolock() for BPF needs when context doesn't allow using normal alloc_pages. This is a prerequisite fo
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
Use try_alloc_pages() and free_pages_nolock() for BPF needs when context doesn't allow using normal alloc_pages. This is a prerequisite for further work.
Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
a687df20 |
| 17-Feb-2025 |
Amery Hung <[email protected]> |
bpf: Support getting referenced kptr from struct_ops argument
Allows struct_ops programs to acqurie referenced kptrs from arguments by directly reading the argument.
The verifier will acquire a ref
bpf: Support getting referenced kptr from struct_ops argument
Allows struct_ops programs to acqurie referenced kptrs from arguments by directly reading the argument.
The verifier will acquire a reference for struct_ops a argument tagged with "__ref" in the stub function in the beginning of the main program. The user will be able to access the referenced kptr directly by reading the context as long as it has not been released by the program.
This new mechanism to acquire referenced kptr (compared to the existing "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons. In the first use case, Qdisc_ops, an skb is passed to .enqueue in the first argument. This mechanism provides a natural way for users to get a referenced kptr in the .enqueue struct_ops programs and makes sure that a qdisc will always enqueue or drop the skb.
Signed-off-by: Amery Hung <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
43205180 |
| 17-Feb-2025 |
Amery Hung <[email protected]> |
bpf: Make every prog keep a copy of ctx_arg_info
Currently, ctx_arg_info is read-only in the view of the verifier since it is shared among programs of the same attach type. Make each program have th
bpf: Make every prog keep a copy of ctx_arg_info
Currently, ctx_arg_info is read-only in the view of the verifier since it is shared among programs of the same attach type. Make each program have their own copy of ctx_arg_info so that we can use it to store program specific information.
In the next patch where we support acquiring a referenced kptr through a struct_ops argument tagged with "__ref", ctx_arg_info->ref_obj_id will be used to store the unique reference object id of the argument. This avoids creating a requirement in the verifier that "__ref" tagged arguments must be the first set of references acquired [0].
[0] https://lore.kernel.org/bpf/[email protected]/
Signed-off-by: Amery Hung <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
c00d738e |
| 13-Dec-2024 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly.
However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set.
More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway.
Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit.
[0]: https://lore.kernel.org/bpf/[email protected]
Reported-by: Manu Bretelle <[email protected]> Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
4e885fab |
| 13-Dec-2024 |
Anton Protopopov <[email protected]> |
bpf: Add a __btf_get_by_fd helper
Add a new helper to get a pointer to a struct btf from a file descriptor. This helper doesn't increase a refcnt. Add a comment explaining this and pointing to a cor
bpf: Add a __btf_get_by_fd helper
Add a new helper to get a pointer to a struct btf from a file descriptor. This helper doesn't increase a refcnt. Add a comment explaining this and pointing to a corresponding function which does take a reference.
Signed-off-by: Anton Protopopov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
| #
7d0d6736 |
| 10-Dec-2024 |
Jann Horn <[email protected]> |
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
Currently, the pointer stored in call->prog_array is loaded in __uprobe_perf_func(), with no RCU annotation and no immediately visible RCU
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
Currently, the pointer stored in call->prog_array is loaded in __uprobe_perf_func(), with no RCU annotation and no immediately visible RCU protection, so it looks as if the loaded pointer can immediately be dangling. Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical section, but this is too late. It then uses rcu_dereference_check(), but this use of rcu_dereference_check() does not actually dereference anything.
Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller provide rcu_read_lock_trace() protection and then load call->prog_array with rcu_dereference_check().
This issue seems to be theoretical: I don't know of any way to reach this code without having handle_swbp() further up the stack, which is already holding a rcu_read_lock_trace() lock, so where we take rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe() doesn't actually have any effect.
Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
| #
81f6d053 |
| 10-Dec-2024 |
Eduard Zingerman <[email protected]> |
bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether to invalidate all packet pointers in current state depending on the
bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether to invalidate all packet pointers in current state depending on the changes_pkt_data property of the global sub-program.
Because of this, an extension program replacing a global sub-program must be compatible with changes_pkt_data property of the sub-program being replaced.
This commit: - adds changes_pkt_data flag to struct bpf_prog_aux: - this flag is set in check_cfg() for main sub-program; - in jit_subprogs() for other sub-programs; - modifies bpf_check_attach_btf_id() to check changes_pkt_data flag; - moves call to check_attach_btf_id() after the call to check_cfg(), because it needs changes_pkt_data flag to be set:
bpf_check: ... ... - check_attach_btf_id resolve_pseudo_ldimm64 resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded bpf_prog_is_offloaded check_cfg check_cfg + check_attach_btf_id ... ...
The following fields are set by check_attach_btf_id(): - env->ops - prog->aux->attach_btf_trace - prog->aux->attach_func_name - prog->aux->attach_func_proto - prog->aux->dst_trampoline - prog->aux->mod - prog->aux->saved_dst_attach_type - prog->aux->saved_dst_prog_type - prog->expected_attach_type
Neither of these fields are used by resolve_pseudo_ldimm64() or bpf_prog_offload_verifier_prep() (for netronome and netdevsim drivers), so the reordering is safe.
Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2 |
|
| #
7cd1107f |
| 03-Dec-2024 |
Alexander Lobakin <[email protected]> |
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Addre
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments.
Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12 |
|
| #
7c8ce4ff |
| 12-Nov-2024 |
Xu Kuohai <[email protected]> |
bpf: Add kernel symbol for struct_ops trampoline
Without kernel symbols for struct_ops trampoline, the unwinder may produce unexpected stacktraces.
For example, the x86 ORC and FP unwinders check i
bpf: Add kernel symbol for struct_ops trampoline
Without kernel symbols for struct_ops trampoline, the unwinder may produce unexpected stacktraces.
For example, the x86 ORC and FP unwinders check if an IP is in kernel text by verifying the presence of the IP's kernel symbol. When a struct_ops trampoline address is encountered, the unwinder stops due to the absence of symbol, resulting in an incomplete stacktrace that consists only of direct and indirect child functions called from the trampoline.
The arm64 unwinder is another example. While the arm64 unwinder can proceed across a struct_ops trampoline address, the corresponding symbol name is displayed as "unknown", which is confusing.
Thus, add kernel symbol for struct_ops trampoline. The name is bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the type name of the struct_ops, and <member_name> is the name of the member that the trampoline is linked to.
Below is a comparison of stacktraces captured on x86 by perf record, before and after this patch.
Before: ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms])
After: ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms]) ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms]) ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms]) ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms]) ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms]) ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms]) ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms]) ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms]) ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms]) ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms]) ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms]) ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms]) ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms]) ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms]) ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms]) ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Xu Kuohai <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
5bd36da1 |
| 12-Nov-2024 |
Yonghong Song <[email protected]> |
bpf: Support private stack for struct_ops progs
For struct_ops progs, whether a particular prog uses private stack depends on prog->aux->priv_stack_requested setting before actual insn-level verific
bpf: Support private stack for struct_ops progs
For struct_ops progs, whether a particular prog uses private stack depends on prog->aux->priv_stack_requested setting before actual insn-level verification for that prog. One particular implementation is to piggyback on struct_ops->check_member(). The next patch has an example for this. The struct_ops->check_member() sets prog->aux->priv_stack_requested to be true which enables private stack usage.
The struct_ops prog follows the same rule as kprobe/tracing progs after function bpf_enable_priv_stack(). For example, even a struct_ops prog requests private stack, it could still use normal kernel stack if the stack size is small (< 64 bytes).
Similar to tracing progs, nested same cpu same prog run will be skipped. A field (recursion_detected()) is added to bpf_prog_aux structure. If bpf_prog->aux->recursion_detected is implemented by the struct_ops subsystem and nested same cpu/prog happens, the function will be triggered to report an error, collect related info, etc.
Acked-by: Tejun Heo <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
7d1cd70d |
| 12-Nov-2024 |
Yonghong Song <[email protected]> |
bpf, x86: Support private stack in jit
Private stack is allocated in function bpf_int_jit_compile() with alignment 8. Private stack allocation size includes the stack size determined by verifier and
bpf, x86: Support private stack in jit
Private stack is allocated in function bpf_int_jit_compile() with alignment 8. Private stack allocation size includes the stack size determined by verifier and additional space to protect stack overflow and underflow. See below an illustration: ---> memory address increasing [8 bytes to protect overflow] [normal stack] [8 bytes to protect underflow] If overflow/underflow is detected, kernel messages will be emited in dmesg like BPF private stack overflow/underflow detected for prog Fx BPF Private stack overflow/underflow detected for prog bpf_prog_a41699c234a1567a_subprog1x Those messages are generated when I made some changes to jitted code to intentially cause overflow for some progs.
For the jited prog, The x86 register 9 (X86_REG_R9) is used to replace bpf frame register (BPF_REG_10). The private stack is used per subprog per cpu. The X86_REG_R9 is saved and restored around every func call (not including tailcall) to maintain correctness of X86_REG_R9.
Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
e00931c0 |
| 12-Nov-2024 |
Yonghong Song <[email protected]> |
bpf: Enable private stack for eligible subprogs
If private stack is used by any subprog, set that subprog prog->aux->jits_use_priv_stack to be true so later jit can allocate private stack for that s
bpf: Enable private stack for eligible subprogs
If private stack is used by any subprog, set that subprog prog->aux->jits_use_priv_stack to be true so later jit can allocate private stack for that subprog properly.
Also set env->prog->aux->jits_use_priv_stack to be true if any subprog uses private stack. This is a use case for a single main prog (no subprogs) to use private stack, and also a use case for later struct-ops progs where env->prog->aux->jits_use_priv_stack will enable recursion check if any subprog uses private stack.
Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
213a6952 |
| 11-Nov-2024 |
Menglong Dong <[email protected]> |
bpf: Replace the document for PTR_TO_BTF_ID_OR_NULL
Commit c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL") moved the fields around and misplaced the documentation f
bpf: Replace the document for PTR_TO_BTF_ID_OR_NULL
Commit c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL") moved the fields around and misplaced the documentation for "PTR_TO_BTF_ID_OR_NULL". So, let's replace it in the proper place.
Signed-off-by: Menglong Dong <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.12-rc7 |
|
| #
cb4158ce |
| 04-Nov-2024 |
Kumar Kartikeya Dwivedi <[email protected]> |
bpf: Mark raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw t
bpf: Mark raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0].
Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL checks to be deleted, and accesses to such pointers potentially crashing the kernel.
To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special case the dereference and pointer arithmetic to permit it, and allow passing them into helpers/kfuncs; these exceptions are made for raw_tp programs only. Ensure that we don't do this when ref_obj_id > 0, as in that case this is an acquired object and doesn't need such adjustment.
The reason we do mask_raw_tp_trusted_reg logic is because other will recheck in places whether the register is a trusted_reg, and then consider our register as untrusted when detecting the presence of the PTR_MAYBE_NULL flag.
To allow safe dereference, we enable PROBE_MEM marking when we see loads into trusted pointers with PTR_MAYBE_NULL.
While trusted raw_tp arguments can also be passed into helpers or kfuncs where such broken assumption may cause issues, a future patch set will tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can already be passed into helpers and causes similar problems. Thus, they are left alone for now.
It is possible that these checks also permit passing non-raw_tp args that are trusted PTR_TO_BTF_ID with null marking. In such a case, allowing dereference when pointer is NULL expands allowed behavior, so won't regress existing programs, and the case of passing these into helpers is the same as above and will be dealt with later.
Also update the failure case in tp_btf_nullable selftest to capture the new behavior, as the verifier will no longer cause an error when directly dereference a raw tracepoint argument marked as __nullable.
[0]: https://lore.kernel.org/bpf/[email protected]
Reviewed-by: Jiri Olsa <[email protected]> Reported-by: Juri Lelli <[email protected]> Tested-by: Juri Lelli <[email protected]> Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6 |
|
| #
61c6fefa |
| 01-Nov-2024 |
Andrii Nakryiko <[email protected]> |
bpf: decouple BPF link/attach hook and BPF program sleepable semantics
BPF link's lifecycle protection scheme depends on both BPF hook and BPF program. If *either* of those require RCU Tasks Trace G
bpf: decouple BPF link/attach hook and BPF program sleepable semantics
BPF link's lifecycle protection scheme depends on both BPF hook and BPF program. If *either* of those require RCU Tasks Trace GP, then we need to go through a chain of GPs before putting BPF program refcount and deallocating BPF link memory.
This patch adds bpf_link-specific sleepable flag, which can be set to true even if underlying BPF program is not sleepable itself. If either link->sleepable or link->prog->sleepable is true, we'll go through a chain of RCU Tasks Trace GP and RCU GP before putting BPF program and freeing memory.
This will be used to protect BPF link for sleepable (faultable) raw tracepoints in the next patch.
Link: https://lore.kernel.org/[email protected] Tested-by: Jordan Rife <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5 |
|
| #
ba512b00 |
| 23-Oct-2024 |
Martin KaFai Lau <[email protected]> |
bpf: Add uptr support in the map_value of the task local storage.
This patch adds uptr support in the map_value of the task local storage.
struct map_value { struct user_data __uptr *uptr; };
str
bpf: Add uptr support in the map_value of the task local storage.
This patch adds uptr support in the map_value of the task local storage.
struct map_value { struct user_data __uptr *uptr; };
struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC); __type(key, int); __type(value, struct value_type); } datamap SEC(".maps");
A new bpf_obj_pin_uptrs() is added to pin the user page and also stores the kernel address back to the uptr for the bpf prog to use later. It currently does not support the uptr pointing to a user struct across two pages. It also excludes PageHighMem support to keep it simple. As of now, the 32bit bpf jit is missing other more crucial bpf features. For example, many important bpf features depend on bpf kfunc now but so far only one arch (x86-32) supports it which was added by me as an example when kfunc was first introduced to bpf.
The uptr can only be stored to the task local storage by the syscall update_elem. Meaning the uptr will not be considered if it is provided by the bpf prog through bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE). This is enforced by only calling bpf_local_storage_update(swap_uptrs==true) in bpf_pid_task_storage_update_elem. Everywhere else will have swap_uptrs==false.
This will pump down to bpf_selem_alloc(swap_uptrs==true). It is the only case that bpf_selem_alloc() will take the uptr value when updating the newly allocated selem. bpf_obj_swap_uptrs() is added to swap the uptr between the SDATA(selem)->data and the user provided map_value in "void *value". bpf_obj_swap_uptrs() makes the SDATA(selem)->data takes the ownership of the uptr and the user space provided map_value will have NULL in the uptr.
The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem() returning error. If the map->ops->map_update_elem has reached a state that the local storage has taken the uptr ownership, the bpf_obj_unpin_uptrs() will be a no op because the uptr is NULL. A "__"bpf_obj_unpin_uptrs is added to make this error path unpin easier such that it does not have to check the map->record is NULL or not.
BPF_F_LOCK is not supported when the map_value has uptr. This can be revisited later if there is a use case. A similar swap_uptrs idea can be considered.
The final bit is to do unpin_user_page in the bpf_obj_free_fields(). The earlier patch has ensured that the bpf_obj_free_fields() has gone through the rcu gp when needed.
Cc: [email protected] Cc: Shakeel Butt <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Shakeel Butt <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
1cb80d9e |
| 23-Oct-2024 |
Kui-Feng Lee <[email protected]> |
bpf: Support __uptr type tag in BTF
This patch introduces the "__uptr" type tag to BTF. It is to define a pointer pointing to the user space memory. This patch adds BTF logic to pass the "__uptr" ty
bpf: Support __uptr type tag in BTF
This patch introduces the "__uptr" type tag to BTF. It is to define a pointer pointing to the user space memory. This patch adds BTF logic to pass the "__uptr" type tag.
btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only be supported in the map_value of the task storage map. However, btf_parse_struct_meta() also uses btf_find_kptr() but it is not interested in "__uptr". This patch adds a "field_mask" argument to btf_find_kptr() which will return BTF_FIELD_IGNORE if the caller is not interested in a “__uptr” field.
btf_parse_kptr() is also reused to parse the uptr. The btf_check_and_fixup_fields() is changed to do extra checks on the uptr to ensure that its struct size is not larger than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE supported kernel struct will be used, so it is also not allowed now.
Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
6fad274f |
| 21-Oct-2024 |
Daniel Borkmann <[email protected]> |
bpf: Add MEM_WRITE attribute
Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper wri
bpf: Add MEM_WRITE attribute
Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper writes into the memory passed as an argument. In the past MEM_UNINIT has been (ab)used for this function, but the latter merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that there are also cases where the passed memory is read + written which currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error").
Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|