|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
35b98180 |
| 13-Feb-2025 |
Hengqi Chen <[email protected]> |
tracing: Remove orphaned event_trace_printk
The event_trace_printk macro has no callers since commit b8e65554d80b ("tracing: remove deprecated TRACE_FORMAT"). So drop it.
Link: https://lore.kernel.
tracing: Remove orphaned event_trace_printk
The event_trace_printk macro has no callers since commit b8e65554d80b ("tracing: remove deprecated TRACE_FORMAT"). So drop it.
Link: https://lore.kernel.org/[email protected] Signed-off-by: Hengqi Chen <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5 |
|
| #
1bd13edb |
| 27-Dec-2024 |
Masami Hiramatsu (Google) <[email protected]> |
tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN.
Currently, there is no way to
tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN.
Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events.
This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file.
NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram.
Cc: Shuah Khan <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
452f4b31 |
| 25-Nov-2024 |
Christian Göttsche <[email protected]> |
tracing: Constify string literal data member in struct trace_event_call
The name member of the struct trace_event_call is assigned with generated string literals; declare them pointer to read-only.
tracing: Constify string literal data member in struct trace_event_call
The name member of the struct trace_event_call is assigned with generated string literals; declare them pointer to read-only.
Reported by clang:
security/landlock/syscalls.c:179:1: warning: initializing 'char *' with an expression of type 'const char[34]' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers] 179 | SYSCALL_DEFINE3(landlock_create_ruleset, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 180 | const struct landlock_ruleset_attr __user *const, attr, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 181 | const size_t, size, const __u32, flags) | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/syscalls.h:226:36: note: expanded from macro 'SYSCALL_DEFINE3' 226 | #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/syscalls.h:234:2: note: expanded from macro 'SYSCALL_DEFINEx' 234 | SYSCALL_METADATA(sname, x, __VA_ARGS__) \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/syscalls.h:184:2: note: expanded from macro 'SYSCALL_METADATA' 184 | SYSCALL_TRACE_ENTER_EVENT(sname); \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/syscalls.h:151:30: note: expanded from macro 'SYSCALL_TRACE_ENTER_EVENT' 151 | .name = "sys_enter"#sname, \ | ^~~~~~~~~~~~~~~~~
Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Mickaël Salaün <[email protected]> Cc: Günther Noack <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Bill Wendling <[email protected]> Cc: Justin Stitt <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: b77e38aa240c3 ("tracing: add event trace infrastructure") Signed-off-by: Christian Göttsche <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
afd2627f |
| 17-Dec-2024 |
Steven Rostedt <[email protected]> |
tracing: Check "%s" dereference via the field and not the TP_printk format
The TP_printk() portion of a trace event is executed at the time a event is read from the trace. This can happen seconds, m
tracing: Check "%s" dereference via the field and not the TP_printk format
The TP_printk() portion of a trace event is executed at the time a event is read from the trace. This can happen seconds, minutes, hours, days, months, years possibly later since the event was recorded. If the print format contains a dereference to a string via "%s", and that string was allocated, there's a chance that string could be freed before it is read by the trace file.
To protect against such bugs, there are two functions that verify the event. The first one is test_event_printk(), which is called when the event is created. It reads the TP_printk() format as well as its arguments to make sure nothing may be dereferencing a pointer that was not copied into the ring buffer along with the event. If it is, it will trigger a WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in the ring buffer but may still be valid. Strings that are static and part of the kernel proper which will not be freed for the life of the running system, are safe to dereference. But to know if it is a pointer to a static string or to something on the heap can not be determined until the event is triggered.
This brings us to the second function that tests for the bad dereferencing of strings, trace_check_vprintf(). It would walk through the printf format looking for "%s", and when it finds it, it would validate that the pointer is safe to read. If not, it would produces a WARN_ON() as well and write into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle all the cases that it didn't need to check. Instead of re-implementing vsnprintf(), it would make a copy of the format up to the %s part, and call vsnprintf() with the current va_list ap variable, where the ap would then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For architectures that passed it by copy it was not. A test_can_verify() function was used to differentiate between the two, and if it wasn't possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely on such a method that is undocumented, and could cause issues later on with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at "%s" as well. If the "%s" argument is a pointer outside the event in the ring buffer, it would find the field type of the event that is the problem and mark the structure with a new flag called "needs_test". The event itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that this event has a field that needs to be verified before the event can be printed using the printf format.
When the event fields are created from the field type structure, the fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called which will check if the event has the TEST_STR flag set (if not, it returns false). If the flag is set, it then iterates through the events fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer in the ring buffer event. It runs the tests to make sure that pointer is safe to print and if not, it triggers the WARN_ON() and also adds to the trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/
Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Al Viro <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
0172afef |
| 22-Nov-2024 |
Thomas Gleixner <[email protected]> |
tracing: Record task flag NEED_RESCHED_LAZY.
The scheduler added NEED_RESCHED_LAZY scheduling. Record this state as part of trace flags and expose it in the need_resched field.
Record and expose NE
tracing: Record task flag NEED_RESCHED_LAZY.
The scheduler added NEED_RESCHED_LAZY scheduling. Record this state as part of trace flags and expose it in the need_resched field.
Record and expose NEED_RESCHED_LAZY.
[bigeasy: Commit description, documentation bits.]
Cc: Peter Zijlstra <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Link: https://lore.kernel.org/[email protected] Reviewed-by: Ankur Arora <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5 |
|
| #
e9f0a363 |
| 22-Oct-2024 |
Sebastian Andrzej Siewior <[email protected]> |
tracing: Remove TRACE_FLAG_IRQS_NOSUPPORT
It was possible to enable tracing with no IRQ tracing support. The tracing infrastructure would then record TRACE_FLAG_IRQS_NOSUPPORT as the only tracing fl
tracing: Remove TRACE_FLAG_IRQS_NOSUPPORT
It was possible to enable tracing with no IRQ tracing support. The tracing infrastructure would then record TRACE_FLAG_IRQS_NOSUPPORT as the only tracing flag and show an 'X' in the output.
The last user of this feature was PPC32 which managed to implement it during PowerPC merge in 2009. Since then, it was unused and the PPC32 dependency was finally removed in commit 0ea5ee035133a ("tracing: Remove PPC32 wart from config TRACING_SUPPORT"). Since the PowerPC merge the code behind !CONFIG_TRACE_IRQFLAGS_SUPPORT with TRACING enabled can no longer be selected used and the 'X' is not displayed or recorded.
Remove the CONFIG_TRACE_IRQFLAGS_SUPPORT from the tracing code. Remove TRACE_FLAG_IRQS_NOSUPPORT.
Cc: Peter Zijlstra <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11 |
|
| #
49e4154f |
| 11-Sep-2024 |
Zheng Yejian <[email protected]> |
tracing: Remove TRACE_EVENT_FL_FILTERED logic
After commit dcb0b5575d24 ("tracing: Remove TRACE_EVENT_FL_USE_CALL_FILTER logic"), no one's going to set the TRACE_EVENT_FL_FILTERED or change the cal
tracing: Remove TRACE_EVENT_FL_FILTERED logic
After commit dcb0b5575d24 ("tracing: Remove TRACE_EVENT_FL_USE_CALL_FILTER logic"), no one's going to set the TRACE_EVENT_FL_FILTERED or change the call->filter, so remove related logic.
Link: https://lore.kernel.org/[email protected] Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
6e2fdcef |
| 26-Jul-2024 |
Steven Rostedt <[email protected]> |
tracing: Use refcount for trace_event_file reference counter
Instead of using an atomic counter for the trace_event_file reference counter, use the refcount interface. It has various checks to make
tracing: Use refcount for trace_event_file reference counter
Instead of using an atomic counter for the trace_event_file reference counter, use the refcount interface. It has various checks to make sure the reference counting is correct, and will warn if it detects an error (like refcount_inc() on '0').
Cc: Mathieu Desnoyers <[email protected]> Link: https://lore.kernel.org/[email protected] Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
0e8b5397 |
| 05-Aug-2024 |
Menglong Dong <[email protected]> |
bpf: kprobe: remove unused declaring of bpf_kprobe_override
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one"), "bpf_kprobe_override" is not used an
bpf: kprobe: remove unused declaring of bpf_kprobe_override
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one"), "bpf_kprobe_override" is not used anywhere anymore, and we can remove it now.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one") Signed-off-by: Menglong Dong <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
show more ...
|
| #
1cbe8143 |
| 30-Jul-2024 |
Menglong Dong <[email protected]> |
bpf: kprobe: Remove unused declaring of bpf_kprobe_override
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one"), "bpf_kprobe_override" is not used an
bpf: kprobe: Remove unused declaring of bpf_kprobe_override
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one"), "bpf_kprobe_override" is not used anywhere anymore, and we can remove it now.
Fixes: 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one") Signed-off-by: Menglong Dong <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
d4dfc570 |
| 19-Mar-2024 |
Andrii Nakryiko <[email protected]> |
bpf: pass whole link instead of prog when triggering raw tracepoint
Instead of passing prog as an argument to bpf_trace_runX() helpers, that are called from tracepoint triggering calls, store BPF li
bpf: pass whole link instead of prog when triggering raw tracepoint
Instead of passing prog as an argument to bpf_trace_runX() helpers, that are called from tracepoint triggering calls, store BPF link itself (struct bpf_raw_tp_link for raw tracepoints). This will allow to pass extra information like BPF cookie into raw tracepoint registration.
Instead of replacing `struct bpf_prog *prog = __data;` with corresponding `struct bpf_raw_tp_link *link = __data;` assignment in `__bpf_trace_##call` I just passed `__data` through into underlying bpf_trace_runX() call. This works well because we implicitly cast `void *`, and it also avoids naming clashes with arguments coming from tracepoint's "proto" list. We could have run into the same problem with "prog", we just happened to not have a tracepoint that has "prog" input argument. We are less lucky with "link", as there are tracepoints using "link" argument name already. So instead of trying to avoid naming conflicts, let's just remove intermediate local variable. It doesn't hurt readibility, it's either way a bit of a maze of calls and macros, that requires careful reading.
Acked-by: Stanislav Fomichev <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Message-ID: <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7, v6.8-rc6 |
|
| #
70a6ed55 |
| 22-Feb-2024 |
Steven Rostedt (Google) <[email protected]> |
tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
The TRACE_EVENT macros has some dependency if a __string() field is NULL, where it will save "(null)" as the string. This string is
tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
The TRACE_EVENT macros has some dependency if a __string() field is NULL, where it will save "(null)" as the string. This string is also used by __assign_str(). It's better to create a single macro instead of having something that will not be caught by the compiler if there is an unfortunate typo.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ville Syrjälä <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Chuck Lever <[email protected]> Suggested-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
2aa043a5 |
| 12-Mar-2024 |
Steven Rostedt (Google) <[email protected]> |
tracing/ring-buffer: Fix wait_on_pipe() race
When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fiel
tracing/ring-buffer: Fix wait_on_pipe() race
When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race.
CPU 0 CPU 1 ----- ----- wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait();
The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of:
prepare_to_wait(); if (!condition) schedule();
Where the missing condition check is the iter->wait_index update.
Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function.
In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop.
Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor.
Have the wait_on_pipe() condition callback also check the closed field.
Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers.
Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space.
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: linke li <[email protected]> Cc: Rabin Vincent <[email protected]> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
bb32500f |
| 31-Oct-2023 |
Steven Rostedt (Google) <[email protected]> |
tracing: Have trace_event_file have ref counters
The following can crash the kernel:
# cd /sys/kernel/tracing # echo 'p:sched schedule' > kprobe_events # exec 5>>events/kprobes/sched/enable # >
tracing: Have trace_event_file have ref counters
The following can crash the kernel:
# cd /sys/kernel/tracing # echo 'p:sched schedule' > kprobe_events # exec 5>>events/kprobes/sched/enable # > kprobe_events # exec 5>&-
The above commands:
1. Change directory to the tracefs directory 2. Create a kprobe event (doesn't matter what one) 3. Open bash file descriptor 5 on the enable file of the kprobe event 4. Delete the kprobe event (removes the files too) 5. Close the bash file descriptor 5
The above causes a crash!
BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:tracing_release_file_tr+0xc/0x50
What happens here is that the kprobe event creates a trace_event_file "file" descriptor that represents the file in tracefs to the event. It maintains state of the event (is it enabled for the given instance?). Opening the "enable" file gets a reference to the event "file" descriptor via the open file descriptor. When the kprobe event is deleted, the file is also deleted from the tracefs system which also frees the event "file" descriptor.
But as the tracefs file is still opened by user space, it will not be totally removed until the final dput() is called on it. But this is not true with the event "file" descriptor that is already freed. If the user does a write to or simply closes the file descriptor it will reference the event "file" descriptor that was just freed, causing a use-after-free bug.
To solve this, add a ref count to the event "file" descriptor as well as a new flag called "FREED". The "file" will not be freed until the last reference is released. But the FREE flag will be set when the event is removed to prevent any more modifications to that event from happening, even if there's still a reference to the event "file" descriptor.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: [email protected] Cc: Mark Rutland <[email protected]> Fixes: f5ca233e2e66d ("tracing: Increase trace array ref count on enable and filter files") Reported-by: Beau Belgrave <[email protected]> Tested-by: Beau Belgrave <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5 |
|
| #
5790b1fb |
| 04-Oct-2023 |
Steven Rostedt (Google) <[email protected]> |
eventfs: Remove eventfs_file and just use eventfs_inode
Instead of having a descriptor for every file represented in the eventfs directory, only have the directory itself represented. Change the API
eventfs: Remove eventfs_file and just use eventfs_inode
Instead of having a descriptor for every file represented in the eventfs directory, only have the directory itself represented. Change the API to send in a list of entries that represent all the files in the directory (but not other directories). The entry list contains a name and a callback function that will be used to create the files when they are accessed.
struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, const struct eventfs_entry *entries, int size, void *data);
is used for the top level eventfs directory, and returns an eventfs_inode that will be used by:
struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, int size, void *data);
where both of the above take an array of struct eventfs_entry entries for every file that is in the directory.
The entries are defined by:
typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops);
struct eventfs_entry { const char *name; eventfs_callback callback; };
Where the name is the name of the file and the callback gets called when the file is being created. The callback passes in the name (in case the same callback is used for multiple files), a pointer to the mode, data and fops. The data will be pointing to the data that was passed in eventfs_create_dir() or eventfs_create_events_dir() but may be overridden to point to something else, as it will be used to point to the inode->i_private that is created. The information passed back from the callback is used to create the dentry/inode.
If the callback fills the data and the file should be created, it must return a positive number. On zero or negative, the file is ignored.
This logic may also be used as a prototype to convert entire pseudo file systems into just-in-time allocation.
The "show_events_dentry" file has been updated to show the directories, and any files they have.
With just the eventfs_file allocations:
Before after deltas for meminfo (in kB):
MemFree: -14360 MemAvailable: -14260 Buffers: 40 Cached: 24 Active: 44 Inactive: 48 Inactive(anon): 28 Active(file): 44 Inactive(file): 20 Dirty: -4 AnonPages: 28 Mapped: 4 KReclaimable: 132 Slab: 1604 SReclaimable: 132 SUnreclaim: 1472 Committed_AS: 12
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
ext4_inode_cache 27 [* 1184 = 31968 ] extent_status 102 [* 40 = 4080 ] tracefs_inode_cache 144 [* 656 = 94464 ] buffer_head 39 [* 104 = 4056 ] shmem_inode_cache 49 [* 800 = 39200 ] filp -53 [* 256 = -13568 ] dentry 251 [* 192 = 48192 ] lsm_file_cache 277 [* 32 = 8864 ] vm_area_struct -14 [* 184 = -2576 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k 35 [* 1024 = 35840 ] kmalloc-256 49 [* 256 = 12544 ] kmalloc-192 -28 [* 192 = -5376 ] kmalloc-128 -30 [* 128 = -3840 ] kmalloc-96 10581 [* 96 = 1015776 ] kmalloc-64 3056 [* 64 = 195584 ] kmalloc-32 1291 [* 32 = 41312 ] kmalloc-16 2310 [* 16 = 36960 ] kmalloc-8 9216 [* 8 = 73728 ]
Free memory dropped by 14,360 kB Available memory dropped by 14,260 kB Total slab additions in size: 1,771,032 bytes
With this change:
Before after deltas for meminfo (in kB):
MemFree: -12084 MemAvailable: -11976 Buffers: 32 Cached: 32 Active: 72 Inactive: 168 Inactive(anon): 176 Active(file): 72 Inactive(file): -8 Dirty: 24 AnonPages: 196 Mapped: 8 KReclaimable: 148 Slab: 836 SReclaimable: 148 SUnreclaim: 688 Committed_AS: 324
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache 144 [* 656 = 94464 ] shmem_inode_cache -23 [* 800 = -18400 ] filp -92 [* 256 = -23552 ] dentry 179 [* 192 = 34368 ] lsm_file_cache -3 [* 32 = -96 ] vm_area_struct -13 [* 184 = -2392 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k -49 [* 1024 = -50176 ] kmalloc-256 -27 [* 256 = -6912 ] kmalloc-128 1864 [* 128 = 238592 ] kmalloc-64 4685 [* 64 = 299840 ] kmalloc-32 -72 [* 32 = -2304 ] kmalloc-16 256 [* 16 = 4096 ] total = 721352
Free memory dropped by 12,084 kB Available memory dropped by 11,976 kB Total slab additions in size: 721,352 bytes
That's over 2 MB in savings per instance for free and available memory, and over 1 MB in savings per instance of slab memory.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ajay Kaher <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc4, v6.6-rc3 |
|
| #
3acf8ace |
| 20-Sep-2023 |
Jiri Olsa <[email protected]> |
bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution.
The kprobe's missed counter gets incr
bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution.
The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu.
Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1 |
|
| #
fc52a644 |
| 08-Sep-2023 |
Steven Rostedt (Google) <[email protected]> |
tracing/synthetic: Fix order of struct trace_dynamic_info
To make handling BIG and LITTLE endian better the offset/len of dynamic fields of the synthetic events was changed into a structure of:
st
tracing/synthetic: Fix order of struct trace_dynamic_info
To make handling BIG and LITTLE endian better the offset/len of dynamic fields of the synthetic events was changed into a structure of:
struct trace_dynamic_info { #ifdef CONFIG_CPU_BIG_ENDIAN u16 offset; u16 len; #else u16 len; u16 offset; #endif };
to replace the manual changes of:
data_offset = offset & 0xffff; data_offest = len << 16;
But if you look closely, the above is:
<len> << 16 | offset
Which in little endian would be in memory:
offset_lo offset_hi len_lo len_hi
and in big endian:
len_hi len_lo offset_hi offset_lo
Which if broken into a structure would be:
struct trace_dynamic_info { #ifdef CONFIG_CPU_BIG_ENDIAN u16 len; u16 offset; #else u16 offset; u16 len; #endif };
Which is the opposite of what was defined.
Fix this and just to be safe also add "__packed".
Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: [email protected] Cc: Mark Rutland <[email protected]> Tested-by: Sven Schnelle <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Fixes: ddeea494a16f3 ("tracing/synthetic: Use union instead of casts") Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
6fdac58c |
| 08-Sep-2023 |
Steven Rostedt (Google) <[email protected]> |
tracing: Remove unused trace_event_file dir field
Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file
tracing: Remove unused trace_event_file dir field
Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file structure is no longer used. Remove it.
Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ajay Kaher <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1 |
|
| #
39f7c41c |
| 07-Jul-2023 |
Valentin Schneider <[email protected]> |
tracing/filters: Enable filtering a cpumask field by another cpumask
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressi
tracing/filters: Enable filtering a cpumask field by another cpumask
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions.
Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask.
The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression.
This enables e.g.:
$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'
Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
89ae89f5 |
| 09-Aug-2023 |
Jiri Olsa <[email protected]> |
bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes.
Uprobes to attach are specified via new link_create uprobe_multi union:
struct {
bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes.
Uprobes to attach are specified via new link_create uprobe_multi union:
struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi;
Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'.
The 'flags' supports single bit for now that marks the uprobe as return probe.
Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
| #
ddeea494 |
| 16-Aug-2023 |
Sven Schnelle <[email protected]> |
tracing/synthetic: Use union instead of casts
The current code uses a lot of casts to access the fields member in struct synth_trace_events with different sizes. This makes the code hard to read, a
tracing/synthetic: Use union instead of casts
The current code uses a lot of casts to access the fields member in struct synth_trace_events with different sizes. This makes the code hard to read, and had already introduced an endianness bug. Use a union and struct instead.
Link: https://lkml.kernel.org/r/[email protected]
Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Fixes: 00cf3d672a9dd ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
27152bce |
| 28-Jul-2023 |
Ajay Kaher <[email protected]> |
eventfs: Move tracing/events to eventfs
Up until now, /sys/kernel/tracing/events was no different than any other part of tracefs. The files and directories within the events directory was created wh
eventfs: Move tracing/events to eventfs
Up until now, /sys/kernel/tracing/events was no different than any other part of tracefs. The files and directories within the events directory was created when the tracefs was mounted, and also created for the instances in /sys/kernel/tracing/instances/<instance>/events. Most of these files and directories will never be referenced. Since there are thousands of these files and directories they spend their time wasting precious memory resources.
Move the "events" directory to the new eventfs. The eventfs will take the meta data of the events that they represent and store that. When the files in the events directory are referenced, the dentry and inodes to represent them are then created. When the files are no longer referenced, they are freed. This saves the precious memory resources that were wasted on these seldom referenced dentries and inodes.
Running the following:
~# cat /proc/meminfo /proc/slabinfo > before.out ~# mkdir /sys/kernel/tracing/instances/foo ~# cat /proc/meminfo /proc/slabinfo > after.out
to test the changes produces the following deltas:
Before this change:
Before after deltas for meminfo:
MemFree: -32260 MemAvailable: -21496 KReclaimable: 21528 Slab: 22440 SReclaimable: 21528 SUnreclaim: 912 VmallocUsed: 16
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache: 14472 [* 1184 = 17134848] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 28 [* 1480 = 41440] dentry: 14450 [* 312 = 4508400] lsm_inode_cache: 14453 [* 32 = 462496] vma_lock: 11 [* 152 = 1672] vm_area_struct: 2 [* 184 = 368] trace_event_file: 1748 [* 88 = 153824] kmalloc-256: 1072 [* 256 = 274432] kmalloc-64: 2842 [* 64 = 181888]
Total slab additions in size: 22,763,400 bytes
With this change:
Before after deltas for meminfo:
MemFree: -12600 MemAvailable: -12580 Cached: 24 Active: 12 Inactive: 68 Inactive(anon): 48 Active(file): 12 Inactive(file): 20 Dirty: -4 AnonPages: 68 KReclaimable: 12 Slab: 1856 SReclaimable: 12 SUnreclaim: 1844 KernelStack: 16 PageTables: 36 VmallocUsed: 16
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache: 108 [* 1184 = 127872] buffer_head: 24 [* 168 = 4032] hmem_inode_cache: 18 [* 1480 = 26640] dentry: 127 [* 312 = 39624] lsm_inode_cache: 152 [* 32 = 4864] vma_lock: 67 [* 152 = 10184] vm_area_struct: -12 [* 184 = -2208] trace_event_file: 1764 [* 96 = 169344] kmalloc-96: 14322 [* 96 = 1374912] kmalloc-64: 2814 [* 64 = 180096] kmalloc-32: 1103 [* 32 = 35296] kmalloc-16: 2308 [* 16 = 36928] kmalloc-8: 12800 [* 8 = 102400]
Total slab additions in size: 2,109,984 bytes
Which is a savings of 20,653,416 bytes (20 MB) per tracing instance.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ajay Kaher <[email protected]> Co-developed-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Tested-by: Ching-lin Yu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|
| #
5125e757 |
| 09-Jul-2023 |
Yafang Shao <[email protected]> |
bpf: Clear the probe_addr for uprobe
To avoid returning uninitialized or random values when querying the file descriptor (fd) and accessing probe_addr, it is necessary to clear the variable prior to
bpf: Clear the probe_addr for uprobe
To avoid returning uninitialized or random values when querying the file descriptor (fd) and accessing probe_addr, it is necessary to clear the variable prior to its use.
Fixes: 41bdc4b40ed6 ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Yafang Shao <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
334e5519 |
| 06-Jun-2023 |
Masami Hiramatsu (Google) <[email protected]> |
tracing/probes: Add fprobe events for tracing function entry and exit.
Add fprobe events for tracing function entry and exit instead of kprobe events. With this change, we can continue to trace func
tracing/probes: Add fprobe events for tracing function entry and exit.
Add fprobe events for tracing function entry and exit instead of kprobe events. With this change, we can continue to trace function entry/exit even if the CONFIG_KPROBES_ON_FTRACE is not available. Since CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS, it is not available if the architecture only supports CONFIG_DYNAMIC_FTRACE_WITH_ARGS. And that means kprobe events can not probe function entry/exit effectively on such architecture. But this can be solved if the dynamic events supports fprobe events.
The fprobe event is a new dynamic events which is only for the function (symbol) entry and exit. This event accepts non register fetch arguments so that user can trace the function arguments and return values.
The fprobe events syntax is here;
f[:[GRP/][EVENT]] FUNCTION [FETCHARGS] f[MAXACTIVE][:[GRP/][EVENT]] FUNCTION%return [FETCHARGS]
E.g.
# echo 'f vfs_read $arg1' >> dynamic_events # echo 'f vfs_read%return $retval' >> dynamic_events # cat dynamic_events f:fprobes/vfs_read__entry vfs_read arg1=$arg1 f:fprobes/vfs_read__exit vfs_read%return arg1=$retval # echo 1 > events/fprobes/enable # head -n 20 trace | tail # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | sh-142 [005] ...1. 448.386420: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386436: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.386451: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386458: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.386469: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386476: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.602073: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.602089: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
Link: https://lore.kernel.org/all/168507469754.913472.6112857614708350210.stgit@mhiramat.roam.corp.google.com/
Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4 |
|
| #
4b512860 |
| 24-May-2023 |
Steven Rostedt (Google) <[email protected]> |
tracing: Rename stacktrace field to common_stacktrace
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it
tracing: Rename stacktrace field to common_stacktrace
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it just like it was a normal field. We have other pseudo events such as "common_cpu" and "common_timestamp". To stay consistent with that, convert "stacktrace" to "common_stacktrace". As this was used in older kernels, to keep backward compatibility, this will act just like "common_cpu" did with "cpu". That is, "cpu" will be the same as "common_cpu" unless the event has a "cpu" field. In which case, the event's field is used. The same is true with "stacktrace".
Also update the documentation to reflect this change.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Mark Rutland <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
show more ...
|