|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
56799bc0 |
| 04-Mar-2025 |
Frederic Weisbecker <[email protected]> |
perf: Fix hang while freeing sigtrap event
Perf can hang while freeing a sigtrap event if a related deferred signal hadn't managed to be sent before the file got closed:
perf_event_overflow() ta
perf: Fix hang while freeing sigtrap event
Perf can hang while freeing a sigtrap event if a related deferred signal hadn't managed to be sent before the file got closed:
perf_event_overflow() task_work_add(perf_pending_task)
fput() task_work_add(____fput())
task_work_run() ____fput() perf_release() perf_event_release_kernel() _free_event() perf_pending_task_sync() task_work_cancel() -> FAILED rcuwait_wait_event()
Once task_work_run() is running, the list of pending callbacks is removed from the task_struct and from this point on task_work_cancel() can't remove any pending and not yet started work items, hence the task_work_cancel() failure and the hang on rcuwait_wait_event().
Task work could be changed to remove one work at a time, so a work running on the current task can always cancel a pending one, however the wait / wake design is still subject to inverted dependencies when remote targets are involved, as pictured by Oleg:
T1 T2
fd = perf_event_open(pid => T2->pid); fd = perf_event_open(pid => T1->pid); close(fd) close(fd) <IRQ> <IRQ> perf_event_overflow() perf_event_overflow() task_work_add(perf_pending_task) task_work_add(perf_pending_task) </IRQ> </IRQ> fput() fput() task_work_add(____fput()) task_work_add(____fput())
task_work_run() task_work_run() ____fput() ____fput() perf_release() perf_release() perf_event_release_kernel() perf_event_release_kernel() _free_event() _free_event() perf_pending_task_sync() perf_pending_task_sync() rcuwait_wait_event() rcuwait_wait_event()
Therefore the only option left is to acquire the event reference count upon queueing the perf task work and release it from the task work, just like it was done before 3a5465418f5f ("perf: Fix event leak upon exec and file release") but without the leaks it fixed.
Some adjustments are necessary to make it work:
* A child event might dereference its parent upon freeing. Care must be taken to release the parent last.
* Some places assuming the event doesn't have any reference held and therefore can be freed right away must instead put the reference and let the reference counting to its job.
Reported-by: "Yi Lai" <[email protected]> Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/ Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 3a5465418f5f ("perf: Fix event leak upon exec and file release") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
12e766d1 |
| 17-Mar-2025 |
Peter Zijlstra <[email protected]> |
perf: Fix __percpu annotation
With bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address space qualifier") the normal compilers start caring about the __percpu annotation, as such f67d1ff
perf: Fix __percpu annotation
With bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address space qualifier") the normal compilers start caring about the __percpu annotation, as such f67d1ffd841f ("perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes") needs a fixup.
Fixes: f67d1ffd841f ("perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes") Fixes: bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address space qualifier") Reported-by: Stephen Rothwell <[email protected]> Reported-by: [email protected] Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
show more ...
|
| #
bd2da08d |
| 14-Mar-2025 |
Kan Liang <[email protected]> |
perf: Clean up pmu specific data
The pmu specific data is saved in task_struct now. Remove it from event context structure.
Remove swap_task_ctx() as well.
Signed-off-by: Kan Liang <kan.liang@linu
perf: Clean up pmu specific data
The pmu specific data is saved in task_struct now. Remove it from event context structure.
Remove swap_task_ctx() as well.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
d57e94f5 |
| 14-Mar-2025 |
Kan Liang <[email protected]> |
perf: Supply task information to sched_task()
To save/restore LBR call stack data in system-wide mode, the task_struct information is required.
Extend the parameters of sched_task() to supply task_
perf: Supply task information to sched_task()
To save/restore LBR call stack data in system-wide mode, the task_struct information is required.
Extend the parameters of sched_task() to supply task_struct information.
When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
506e64e7 |
| 14-Mar-2025 |
Kan Liang <[email protected]> |
perf: attach/detach PMU specific data
The LBR call stack data has to be saved/restored during context switch to fix the shorter LBRs call stacks issue in the system-wide mode. Allocate PMU specific
perf: attach/detach PMU specific data
The LBR call stack data has to be saved/restored during context switch to fix the shorter LBRs call stacks issue in the system-wide mode. Allocate PMU specific data and attach them to the corresponding task_struct during LBR call stack monitoring.
When a LBR call stack event is accounted, the perf_ctx_data for the related tasks will be allocated/attached by attach_perf_ctx_data(). When a LBR call stack event is unaccounted, the perf_ctx_data for related tasks will be detached/freed by detach_perf_ctx_data().
The LBR call stack event could be a per-task event or a system-wide event. - For a per-task event, perf only allocates the perf_ctx_data for the current task. If the allocation fails, perf will error out. - For a system-wide event, perf has to allocate the perf_ctx_data for both the existing tasks and the upcoming tasks. The allocation for the existing tasks is done in perf_event_alloc(). If any allocation fails, perf will error out. The allocation for the new tasks will be done in perf_event_fork(). A global reader/writer semaphore, global_ctx_data_rwsem, is added to address the global race. - The perf_ctx_data only be freed by the last LBR call stack event. The number of the per-task events is tracked by refcount of each task. Since the system-wide events impact all tasks, it's not practical to go through the whole task list to update the refcount for each system-wide event. The number of system-wide events is tracked by a global variable global_ctx_data_ref.
Suggested-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
cb436912 |
| 14-Mar-2025 |
Kan Liang <[email protected]> |
perf: Save PMU specific data in task_struct
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure,
perf: Save PMU specific data in task_struct
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode.
For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit
- 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3
System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit
- 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3
It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU.
Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished.
Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later.
Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Alexey Budankov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
c53e14f1 |
| 10-Mar-2025 |
Kan Liang <[email protected]> |
perf: Extend per event callchain limit to branch stack
The commit 97c79a38cd45 ("perf core: Per event callchain limit") introduced a per-event term to allow finer tuning of the depth of callchains t
perf: Extend per event callchain limit to branch stack
The commit 97c79a38cd45 ("perf core: Per event callchain limit") introduced a per-event term to allow finer tuning of the depth of callchains to save space.
It should be applied to the branch stack as well. For example, autoFDO collections require maximum LBR entries. In the meantime, other system-wide LBR users may only be interested in the latest a few number of LBRs. A per-event LBR depth would save the perf output buffer.
The patch simply drops the uninterested branches, but HW still collects the maximum branches. There may be a model-specific optimization that can reduce the HW depth for some cases to reduce the overhead further. But it isn't included in the patch set. Because it's not useful for all cases. For example, ARCH LBR can utilize the PEBS and XSAVE to collect LBRs. The depth should have less impact on the collecting overhead. The model-specific optimization may be implemented later separately.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
4eabf533 |
| 04-Nov-2024 |
Peter Zijlstra <[email protected]> |
perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
In prepration for being able to unregister a PMU with existing events, it becomes important to detach struct perf_cpu_pmu_c
perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
In prepration for being able to unregister a PMU with existing events, it becomes important to detach struct perf_cpu_pmu_context lifetimes from that of struct pmu.
Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context that can stay referenced until the last event goes.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
4baeb068 |
| 04-Nov-2024 |
Peter Zijlstra <[email protected]> |
perf/core: Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count
Because it makes no sense to have two per-cpu allocations per pmu.
Signed-off-by: Peter Zijlstra (
perf/core: Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count
Because it makes no sense to have two per-cpu allocations per pmu.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
c70ca298 |
| 04-Nov-2024 |
Peter Zijlstra <[email protected]> |
perf/core: Simplify the perf_event_alloc() error path
The error cleanup sequence in perf_event_alloc() is a subset of the existing _free_event() function (it must of course be).
Split this out into
perf/core: Simplify the perf_event_alloc() error path
The error cleanup sequence in perf_event_alloc() is a subset of the existing _free_event() function (it must of course be).
Split this out into __free_event() and simplify the error path.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
9ec84f79 |
| 23-Dec-2024 |
Luo Gengkun <[email protected]> |
perf: Remove unnecessary parameter of security check
It seems that the attr parameter was never been used in security checks since it was first introduced by:
commit da97e18458fb ("perf_event: Add
perf: Remove unnecessary parameter of security check
It seems that the attr parameter was never been used in security checks since it was first introduced by:
commit da97e18458fb ("perf_event: Add support for LSM and SELinux checks")
so remove it.
Signed-off-by: Luo Gengkun <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Signed-off-by: Paul Moore <[email protected]>
show more ...
|
| #
8aeacf25 |
| 18-Feb-2025 |
Joel Granados <[email protected]> |
perf/core: Move perf_event sysctls into kernel/events
Move ctl tables to two files:
- perf_event_{paranoid,mlock_kb,max_sample_rate} and perf_cpu_time_max_percent into kernel/events/core.c
-
perf/core: Move perf_event sysctls into kernel/events
Move ctl tables to two files:
- perf_event_{paranoid,mlock_kb,max_sample_rate} and perf_cpu_time_max_percent into kernel/events/core.c
- perf_event_max_{stack,context_per_stack} into kernel/events/callchain.c
Make static variables and functions that are fully contained in core.c and callchain.cand remove them from include/linux/perf_event.h. Additionally six_hundred_forty_kb is moved to callchain.c.
Two new sysctl tables are added ({callchain,events_core}_sysctl_table) with their respective sysctl registration functions.
This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kerenel/sysctl.c.
Signed-off-by: Joel Granados <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
8ce939a0 |
| 21-Jan-2025 |
Peter Zijlstra (Intel) <[email protected]> |
perf: Avoid the read if the count is already updated
The event may have been updated in the PMU-specific implementation, e.g., Intel PEBS counters snapshotting. The common code should not read and o
perf: Avoid the read if the count is already updated
The event may have been updated in the PMU-specific implementation, e.g., Intel PEBS counters snapshotting. The common code should not read and overwrite the value.
The PERF_SAMPLE_READ in the data->sample_type can be used to detect whether the PMU-specific value is available. If yes, avoid the pmu->read() in the common code. Add a new flag, skip_read, to track the case.
Factor out a perf_pmu_read() to clean up the code.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
6057b90e |
| 03-Dec-2024 |
Namhyung Kim <[email protected]> |
perf/core: Export perf_exclude_event()
While at it, rename the same function in s390 cpum_sf PMU.
Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tes
perf/core: Export perf_exclude_event()
While at it, rename the same function in s390 cpum_sf PMU.
Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Ravi Bangoria <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Acked-by: Thomas Richter <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1 |
|
| #
faac6f10 |
| 15-May-2024 |
Yabin Cui <[email protected]> |
perf/core: Check sample_type in perf_sample_save_brstack
Check sample_type in perf_sample_save_brstack() to prevent saving branch stack data when it isn't required.
Suggested-by: Namhyung Kim <namh
perf/core: Check sample_type in perf_sample_save_brstack
Check sample_type in perf_sample_save_brstack() to prevent saving branch stack data when it isn't required.
Suggested-by: Namhyung Kim <[email protected]> Signed-off-by: Yabin Cui <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
f226805b |
| 15-May-2024 |
Yabin Cui <[email protected]> |
perf/core: Check sample_type in perf_sample_save_callchain
Check sample_type in perf_sample_save_callchain() to prevent saving callchain data when it isn't required.
Suggested-by: Namhyung Kim <nam
perf/core: Check sample_type in perf_sample_save_callchain
Check sample_type in perf_sample_save_callchain() to prevent saving callchain data when it isn't required.
Suggested-by: Namhyung Kim <[email protected]> Signed-off-by: Yabin Cui <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
b9c44b91 |
| 15-May-2024 |
Yabin Cui <[email protected]> |
perf/core: Save raw sample data conditionally based on sample type
Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This lead
perf/core: Save raw sample data conditionally based on sample type
Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This leads to unused space in sample records when raw sample data is not requested.
This patch enforces checking sample type of an event in perf_sample_save_raw_data(). So raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed.
Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper") Signed-off-by: Yabin Cui <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
2c47e7a7 |
| 13-Nov-2024 |
Colton Lewis <[email protected]> |
perf/core: Correct perf sampling with guest VMs
Previously any PMU overflow interrupt that fired while a VCPU was loaded was recorded as a guest event whether it truly was or not. This resulted in n
perf/core: Correct perf sampling with guest VMs
Previously any PMU overflow interrupt that fired while a VCPU was loaded was recorded as a guest event whether it truly was or not. This resulted in nonsense perf recordings that did not honor perf_event_attr.exclude_guest and recorded guest IPs where it should have recorded host IPs.
Rework the sampling logic to only record guest samples for events with exclude_guest = 0. This way any host-only events with exclude_guest set will never see unexpected guest samples. The behaviour of events with exclude_guest = 0 is unchanged.
Note that events configured to sample both host and guest may still misattribute a PMI that arrived in the host as a guest event depending on KVM arch and vendor behavior.
Signed-off-by: Colton Lewis <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Acked-by: Mark Rutland <[email protected]> Acked-by: Kan Liang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
04782e63 |
| 13-Nov-2024 |
Colton Lewis <[email protected]> |
perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
For clarity, rename the arch-specific definitions of these functions to perf_arch_* to denote they are arch-specifc. Define the gene
perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
For clarity, rename the arch-specific definitions of these functions to perf_arch_* to denote they are arch-specifc. Define the generic-named functions in one place where they can call the arch-specific ones as needed.
Signed-off-by: Colton Lewis <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Acked-by: Thomas Richter <[email protected]> Acked-by: Mark Rutland <[email protected]> Acked-by: Madhavan Srinivasan <[email protected]> Acked-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
18d92bb5 |
| 22-Oct-2024 |
Adrian Hunter <[email protected]> |
perf/core: Add aux_pause, aux_resume, aux_start_paused
Hardware traces, such as instruction traces, can produce a vast amount of trace data, so being able to reduce tracing to more specific circumst
perf/core: Add aux_pause, aux_resume, aux_start_paused
Hardware traces, such as instruction traces, can produce a vast amount of trace data, so being able to reduce tracing to more specific circumstances can be useful.
The ability to pause or resume tracing when another event happens, can do that.
Add ability for an event to "pause" or "resume" AUX area tracing.
Add aux_pause bit to perf_event_attr to indicate that, if the event happens, the associated AUX area tracing should be paused. Ditto aux_resume. Do not allow aux_pause and aux_resume to be set together.
Add aux_start_paused bit to perf_event_attr to indicate to an AUX area event that it should start in a "paused" state.
Add aux_paused to struct hw_perf_event for AUX area events to keep track of the "paused" state. aux_paused is initialized to aux_start_paused.
Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start() callbacks. Call as needed, during __perf_event_output(). Add aux_in_pause_resume to struct perf_buffer to prevent races with the NMI handler. Pause/resume in NMI context will miss out if it coincides with another pause/resume.
To use aux_pause or aux_resume, an event must be in a group with the AUX area event as the group leader.
Example (requires Intel PT and tools patches also):
$ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.043 MB perf.data ] $ perf script --call-trace uname 30805 [000] 24001.058782799: name: 0x7ffc9c1865b0 uname 30805 [000] 24001.058784424: psb offs: 0 uname 30805 [000] 24001.058784424: cbr: 39 freq: 3904 MHz (139%) uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __x64_sys_newuname uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) down_read uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __cond_resched uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) in_lock_functions uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_sub uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) up_read uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) in_lock_functions uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) preempt_count_sub uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) _copy_to_user uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_to_user_mode uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_work uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) perf_syscall_exit uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_alloc uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_get_recursion_context uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_tp_event uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_update uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) tracing_gen_ctx_irq_test uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_event uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __perf_event_account_interrupt uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __this_cpu_preempt_check uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_output_forward uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_aux_pause uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) ring_buffer_get uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_lock uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_unlock uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) pt_event_stop uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) native_write_msr uname 30805 [000] 24001.058785463: ([kernel.kallsyms]) native_write_msr uname 30805 [000] 24001.058785639: 0x0
Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: James Clark <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
a48a36b3 |
| 02-Aug-2024 |
Kan Liang <[email protected]> |
perf: Add PERF_EV_CAP_READ_SCOPE
Usually, an event can be read from any CPU of the scope. It doesn't need to be read from the advertised CPU.
Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event o
perf: Add PERF_EV_CAP_READ_SCOPE
Usually, an event can be read from any CPU of the scope. It doesn't need to be read from the advertised CPU.
Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with scope can be read from any active CPU in the scope.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
4ba4f1af |
| 02-Aug-2024 |
Kan Liang <[email protected]> |
perf: Generic hotplug support for a PMU with a scope
The perf subsystem assumes that the counters of a PMU are per-CPU. So the user space tool reads a counter from each CPU in the system wide mode.
perf: Generic hotplug support for a PMU with a scope
The perf subsystem assumes that the counters of a PMU are per-CPU. So the user space tool reads a counter from each CPU in the system wide mode. However, many PMUs don't have a per-CPU counter. The counter is effective for a scope, e.g., a die or a socket. To address this, a cpumask is exposed by the kernel driver to restrict to one CPU to stand for a specific scope. In case the given CPU is removed, the hotplug support has to be implemented for each such driver.
The codes to support the cpumask and hotplug are very similar. - Expose a cpumask into sysfs - Pickup another CPU in the same scope if the given CPU is removed. - Invoke the perf_pmu_migrate_context() to migrate to a new CPU. - In event init, always set the CPU in the cpumask to event->cpu
Similar duplicated codes are implemented for each such PMU driver. It would be good to introduce a generic infrastructure to avoid such duplication.
5 popular scopes are implemented here, core, die, cluster, pkg, and the system-wide. The scope can be set when a PMU is registered. If so, a "cpumask" is automatically exposed for the PMU.
The "cpumask" is from the perf_online_<scope>_mask, which is to track the active CPU for each scope. They are set when the first CPU of the scope is online via the generic perf hotplug support. When a corresponding CPU is removed, the perf_online_<scope>_mask is updated accordingly and the PMU will be moved to a new CPU from the same scope if possible.
Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
5e9629d0 |
| 27-Aug-2024 |
James Clark <[email protected]> |
drivers/perf: arm_spe: Use perf_allow_kernel() for permissions
Use perf_allow_kernel() for 'pa_enable' (physical addresses), 'pct_enable' (physical timestamps) and context IDs. This means that perf_
drivers/perf: arm_spe: Use perf_allow_kernel() for permissions
Use perf_allow_kernel() for 'pa_enable' (physical addresses), 'pct_enable' (physical timestamps) and context IDs. This means that perf_event_paranoid is now taken into account and LSM hooks can be used, which is more consistent with other perf_event_open calls. For example PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just perfmon_capable().
This also indirectly fixes the following error message which is misleading because perf_event_paranoid is not taken into account by perfmon_capable():
$ perf record -e arm_spe/pa_enable/
Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting ...
Suggested-by: Al Grant <[email protected]> Signed-off-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Will Deacon <[email protected]>
show more ...
|
| #
7e8b2556 |
| 30-Jul-2024 |
Ben Gainey <[email protected]> |
perf: Support PERF_SAMPLE_READ with inherit
This change allows events to use PERF_SAMPLE_READ with inherit so long as PERF_SAMPLE_TID is also set. This enables sample based profiling of a group of c
perf: Support PERF_SAMPLE_READ with inherit
This change allows events to use PERF_SAMPLE_READ with inherit so long as PERF_SAMPLE_TID is also set. This enables sample based profiling of a group of counters over a hierarchy of processes or threads. This is useful, for example, for collecting per-thread counters/metrics, event based sampling of multiple counters as a unit, access to the enabled and running time when using multiplexing and so on.
Prior to this, users were restricted to either collecting aggregate statistics for a multi-threaded/-process application (e.g. with "perf stat"), or to sample individual threads, or to profile the entire system (which requires root or CAP_PERFMON, and may produce much more data than is required). Theoretically a tool could poll for or otherwise monitor thread/process creation and construct whatever events the user is interested in using perf_event_open, for each new thread or process, but this is racy, can lead to file-descriptor exhaustion, and ultimately just replicates the behaviour of inherit, but in userspace.
This configuration differs from inherit without PERF_SAMPLE_READ in that the accumulated event count, and consequently any sample (such as if triggered by overflow of sample_period) will be on a per-thread rather than on an aggregate basis.
The meaning of read_format::value field of both PERF_RECORD_READ and PERF_RECORD_SAMPLE is changed such that if the sampled event uses this new configuration then the values reported will be per-thread rather than the global aggregate value. This is a change from the existing semantics of read_format (where PERF_SAMPLE_READ is used without inherit), but it is necessary to expose the per-thread counter values, and it avoids reinventing a separate "read_format_thread" field that otherwise replicates the same behaviour. This change should not break existing tools, since this configuration was not previously valid and was rejected by the kernel. Tools that opt into this new mode will need to account for this when calculating the counter delta for a given sample. Tools that wish to have both the per-thread and aggregate value can perform the global aggregation themselves from the per-thread values.
The change to read_format::value does not affect existing valid perf_event_attr configurations, nor does it change the behaviour of calls to "read" on an event descriptor. Both continue to report the aggregate value for the entire thread/process hierarchy. The difference between the results reported by "read" and PERF_RECORD_SAMPLE in this new configuration is justified on the basis that it is not (easily) possible for "read" to target a specific thread (the caller only has the fd for the original parent event).
Signed-off-by: Ben Gainey <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
79bd2330 |
| 30-Jul-2024 |
Ben Gainey <[email protected]> |
perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.
nr_pending counts the number of events in the context that either pending_sigtrap or pending_work, but it is used to prevent taking t
perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.
nr_pending counts the number of events in the context that either pending_sigtrap or pending_work, but it is used to prevent taking the fast path in perf_event_context_sched_out.
Renamed to reflect what it is used for, rather than what it counts. This change allows using the field to track other event properties that also require skipping the fast path without possible confusion over the name.
Signed-off-by: Ben Gainey <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|