History log of /linux-6.15/include/linux/perf_event.h (Results 1 – 25 of 468)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6
# 56799bc0 04-Mar-2025 Frederic Weisbecker <[email protected]>

perf: Fix hang while freeing sigtrap event

Perf can hang while freeing a sigtrap event if a related deferred
signal hadn't managed to be sent before the file got closed:

perf_event_overflow()
ta

perf: Fix hang while freeing sigtrap event

Perf can hang while freeing a sigtrap event if a related deferred
signal hadn't managed to be sent before the file got closed:

perf_event_overflow()
task_work_add(perf_pending_task)

fput()
task_work_add(____fput())

task_work_run()
____fput()
perf_release()
perf_event_release_kernel()
_free_event()
perf_pending_task_sync()
task_work_cancel() -> FAILED
rcuwait_wait_event()

Once task_work_run() is running, the list of pending callbacks is
removed from the task_struct and from this point on task_work_cancel()
can't remove any pending and not yet started work items, hence the
task_work_cancel() failure and the hang on rcuwait_wait_event().

Task work could be changed to remove one work at a time, so a work
running on the current task can always cancel a pending one, however
the wait / wake design is still subject to inverted dependencies when
remote targets are involved, as pictured by Oleg:

T1 T2

fd = perf_event_open(pid => T2->pid); fd = perf_event_open(pid => T1->pid);
close(fd) close(fd)
<IRQ> <IRQ>
perf_event_overflow() perf_event_overflow()
task_work_add(perf_pending_task) task_work_add(perf_pending_task)
</IRQ> </IRQ>
fput() fput()
task_work_add(____fput()) task_work_add(____fput())

task_work_run() task_work_run()
____fput() ____fput()
perf_release() perf_release()
perf_event_release_kernel() perf_event_release_kernel()
_free_event() _free_event()
perf_pending_task_sync() perf_pending_task_sync()
rcuwait_wait_event() rcuwait_wait_event()

Therefore the only option left is to acquire the event reference count
upon queueing the perf task work and release it from the task work, just
like it was done before 3a5465418f5f ("perf: Fix event leak upon exec and file release")
but without the leaks it fixed.

Some adjustments are necessary to make it work:

* A child event might dereference its parent upon freeing. Care must be
taken to release the parent last.

* Some places assuming the event doesn't have any reference held and
therefore can be freed right away must instead put the reference and
let the reference counting to its job.

Reported-by: "Yi Lai" <[email protected]>
Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 3a5465418f5f ("perf: Fix event leak upon exec and file release")
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 12e766d1 17-Mar-2025 Peter Zijlstra <[email protected]>

perf: Fix __percpu annotation

With bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address
space qualifier") the normal compilers start caring about the __percpu
annotation, as such f67d1ff

perf: Fix __percpu annotation

With bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address
space qualifier") the normal compilers start caring about the __percpu
annotation, as such f67d1ffd841f ("perf/core: Detach 'struct
perf_cpu_pmu_context' and 'struct pmu' lifetimes") needs a fixup.

Fixes: f67d1ffd841f ("perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes")
Fixes: bcecd5a529c1 ("percpu: repurpose __percpu tag as a named address space qualifier")
Reported-by: Stephen Rothwell <[email protected]>
Reported-by: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

show more ...


# bd2da08d 14-Mar-2025 Kan Liang <[email protected]>

perf: Clean up pmu specific data

The pmu specific data is saved in task_struct now. Remove it from event
context structure.

Remove swap_task_ctx() as well.

Signed-off-by: Kan Liang <kan.liang@linu

perf: Clean up pmu specific data

The pmu specific data is saved in task_struct now. Remove it from event
context structure.

Remove swap_task_ctx() as well.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# d57e94f5 14-Mar-2025 Kan Liang <[email protected]>

perf: Supply task information to sched_task()

To save/restore LBR call stack data in system-wide mode, the task_struct
information is required.

Extend the parameters of sched_task() to supply task_

perf: Supply task information to sched_task()

To save/restore LBR call stack data in system-wide mode, the task_struct
information is required.

Extend the parameters of sched_task() to supply task_struct information.

When schedule in, the LBR call stack data for new task will be restored.
When schedule out, the LBR call stack data for old task will be saved.
Only need to pass the required task_struct information.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 506e64e7 14-Mar-2025 Kan Liang <[email protected]>

perf: attach/detach PMU specific data

The LBR call stack data has to be saved/restored during context switch
to fix the shorter LBRs call stacks issue in the system-wide mode.
Allocate PMU specific

perf: attach/detach PMU specific data

The LBR call stack data has to be saved/restored during context switch
to fix the shorter LBRs call stacks issue in the system-wide mode.
Allocate PMU specific data and attach them to the corresponding
task_struct during LBR call stack monitoring.

When a LBR call stack event is accounted, the perf_ctx_data for the
related tasks will be allocated/attached by attach_perf_ctx_data().
When a LBR call stack event is unaccounted, the perf_ctx_data for
related tasks will be detached/freed by detach_perf_ctx_data().

The LBR call stack event could be a per-task event or a system-wide
event.
- For a per-task event, perf only allocates the perf_ctx_data for the
current task. If the allocation fails, perf will error out.
- For a system-wide event, perf has to allocate the perf_ctx_data for
both the existing tasks and the upcoming tasks.
The allocation for the existing tasks is done in perf_event_alloc().
If any allocation fails, perf will error out.
The allocation for the new tasks will be done in perf_event_fork().
A global reader/writer semaphore, global_ctx_data_rwsem, is added to
address the global race.
- The perf_ctx_data only be freed by the last LBR call stack event.
The number of the per-task events is tracked by refcount of each task.
Since the system-wide events impact all tasks, it's not practical to
go through the whole task list to update the refcount for each
system-wide event. The number of system-wide events is tracked by a
global variable global_ctx_data_ref.

Suggested-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# cb436912 14-Mar-2025 Kan Liang <[email protected]>

perf: Save PMU specific data in task_struct

Some PMU specific data has to be saved/restored during context switch,
e.g. LBR call stack data. Currently, the data is saved in event context
structure,

perf: Save PMU specific data in task_struct

Some PMU specific data has to be saved/restored during context switch,
e.g. LBR call stack data. Currently, the data is saved in event context
structure, but only for per-process event. For system-wide event,
because of missing the LBR call stack data after context switch, LBR
callstacks are always shorter in comparison to per-process mode.

For example,
Per-process mode:
$perf record --call-graph lbr -- taskset -c 0 ./tchain_edit

- 99.90% 99.86% tchain_edit tchain_edit [.] f3
99.86% _start
__libc_start_main
generic_start_main
main
f1
- f2
f3

System-wide mode:
$perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit

- 99.88% 99.82% tchain_edit tchain_edit [.] f3
- 62.02% main
f1
f2
f3
- 28.83% f1
- f2
f3
- 28.83% f1
- f2
f3
- 8.88% generic_start_main
main
f1
f2
f3

It isn't practical to simply allocate the data for system-wide event in
CPU context structure for all tasks. We have no idea which CPU a task
will be scheduled to. The duplicated LBR data has to be maintained on
every CPU context structure. That's a huge waste. Otherwise, the LBR
data still lost if the task is scheduled to another CPU.

Save the pmu specific data in task_struct. The size of pmu specific data
is 788 bytes for LBR call stack. Usually, the overall amount of threads
doesn't exceed a few thousands. For 10K threads, keeping LBR data would
consume additional ~8MB. The additional space will only be allocated
during LBR call stack monitoring. It will be released when the
monitoring is finished.

Furthermore, moving task_ctx_data from perf_event_context to task_struct
can reduce complexity and make things clearer. E.g. perf doesn't need to
swap task_ctx_data on optimized context switch path.
This patch set is just the first step. There could be other
optimization/extension on top of this patch set. E.g. for cgroup
profiling, perf just needs to save/store the LBR call stack information
for tasks in specific cgroup. That could reduce the additional space.
Also, the LBR call stack can be available for software events, or allow
even debugging use cases, like LBRs on crash later.

Because of the alignment requirement of Intel Arch LBR, the Kmem cache
is used to allocate the PMU specific data. It's required when child task
allocates the space. Save it in struct perf_ctx_data.
The refcount in struct perf_ctx_data is used to track the users of pmu
specific data.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexey Budankov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# c53e14f1 10-Mar-2025 Kan Liang <[email protected]>

perf: Extend per event callchain limit to branch stack

The commit 97c79a38cd45 ("perf core: Per event callchain limit")
introduced a per-event term to allow finer tuning of the depth of
callchains t

perf: Extend per event callchain limit to branch stack

The commit 97c79a38cd45 ("perf core: Per event callchain limit")
introduced a per-event term to allow finer tuning of the depth of
callchains to save space.

It should be applied to the branch stack as well. For example, autoFDO
collections require maximum LBR entries. In the meantime, other
system-wide LBR users may only be interested in the latest a few number
of LBRs. A per-event LBR depth would save the perf output buffer.

The patch simply drops the uninterested branches, but HW still collects
the maximum branches. There may be a model-specific optimization that
can reduce the HW depth for some cases to reduce the overhead further.
But it isn't included in the patch set. Because it's not useful for all
cases. For example, ARCH LBR can utilize the PEBS and XSAVE to collect
LBRs. The depth should have less impact on the collecting overhead.
The model-specific optimization may be implemented later separately.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7
# 4eabf533 04-Nov-2024 Peter Zijlstra <[email protected]>

perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes

In prepration for being able to unregister a PMU with existing events,
it becomes important to detach struct perf_cpu_pmu_c

perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes

In prepration for being able to unregister a PMU with existing events,
it becomes important to detach struct perf_cpu_pmu_context lifetimes
from that of struct pmu.

Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context
that can stay referenced until the last event goes.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 4baeb068 04-Nov-2024 Peter Zijlstra <[email protected]>

perf/core: Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count

Because it makes no sense to have two per-cpu allocations per pmu.

Signed-off-by: Peter Zijlstra (

perf/core: Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count

Because it makes no sense to have two per-cpu allocations per pmu.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# c70ca298 04-Nov-2024 Peter Zijlstra <[email protected]>

perf/core: Simplify the perf_event_alloc() error path

The error cleanup sequence in perf_event_alloc() is a subset of the
existing _free_event() function (it must of course be).

Split this out into

perf/core: Simplify the perf_event_alloc() error path

The error cleanup sequence in perf_event_alloc() is a subset of the
existing _free_event() function (it must of course be).

Split this out into __free_event() and simplify the error path.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 9ec84f79 23-Dec-2024 Luo Gengkun <[email protected]>

perf: Remove unnecessary parameter of security check

It seems that the attr parameter was never been used in security
checks since it was first introduced by:

commit da97e18458fb ("perf_event: Add

perf: Remove unnecessary parameter of security check

It seems that the attr parameter was never been used in security
checks since it was first introduced by:

commit da97e18458fb ("perf_event: Add support for LSM and SELinux checks")

so remove it.

Signed-off-by: Luo Gengkun <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul Moore <[email protected]>

show more ...


# 8aeacf25 18-Feb-2025 Joel Granados <[email protected]>

perf/core: Move perf_event sysctls into kernel/events

Move ctl tables to two files:

- perf_event_{paranoid,mlock_kb,max_sample_rate} and
perf_cpu_time_max_percent into kernel/events/core.c

-

perf/core: Move perf_event sysctls into kernel/events

Move ctl tables to two files:

- perf_event_{paranoid,mlock_kb,max_sample_rate} and
perf_cpu_time_max_percent into kernel/events/core.c

- perf_event_max_{stack,context_per_stack} into
kernel/events/callchain.c

Make static variables and functions that are fully contained in core.c
and callchain.cand remove them from include/linux/perf_event.h.
Additionally six_hundred_forty_kb is moved to callchain.c.

Two new sysctl tables are added ({callchain,events_core}_sysctl_table)
with their respective sysctl registration functions.

This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kerenel/sysctl.c.

Signed-off-by: Joel Granados <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 8ce939a0 21-Jan-2025 Peter Zijlstra (Intel) <[email protected]>

perf: Avoid the read if the count is already updated

The event may have been updated in the PMU-specific implementation,
e.g., Intel PEBS counters snapshotting. The common code should not
read and o

perf: Avoid the read if the count is already updated

The event may have been updated in the PMU-specific implementation,
e.g., Intel PEBS counters snapshotting. The common code should not
read and overwrite the value.

The PERF_SAMPLE_READ in the data->sample_type can be used to detect
whether the PMU-specific value is available. If yes, avoid the
pmu->read() in the common code. Add a new flag, skip_read, to track the
case.

Factor out a perf_pmu_read() to clean up the code.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 6057b90e 03-Dec-2024 Namhyung Kim <[email protected]>

perf/core: Export perf_exclude_event()

While at it, rename the same function in s390 cpum_sf PMU.

Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tes

perf/core: Export perf_exclude_event()

While at it, rename the same function in s390 cpum_sf PMU.

Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Ravi Bangoria <[email protected]>
Reviewed-by: Ravi Bangoria <[email protected]>
Acked-by: Thomas Richter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1
# faac6f10 15-May-2024 Yabin Cui <[email protected]>

perf/core: Check sample_type in perf_sample_save_brstack

Check sample_type in perf_sample_save_brstack() to prevent
saving branch stack data when it isn't required.

Suggested-by: Namhyung Kim <namh

perf/core: Check sample_type in perf_sample_save_brstack

Check sample_type in perf_sample_save_brstack() to prevent
saving branch stack data when it isn't required.

Suggested-by: Namhyung Kim <[email protected]>
Signed-off-by: Yabin Cui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# f226805b 15-May-2024 Yabin Cui <[email protected]>

perf/core: Check sample_type in perf_sample_save_callchain

Check sample_type in perf_sample_save_callchain() to prevent
saving callchain data when it isn't required.

Suggested-by: Namhyung Kim <nam

perf/core: Check sample_type in perf_sample_save_callchain

Check sample_type in perf_sample_save_callchain() to prevent
saving callchain data when it isn't required.

Suggested-by: Namhyung Kim <[email protected]>
Signed-off-by: Yabin Cui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# b9c44b91 15-May-2024 Yabin Cui <[email protected]>

perf/core: Save raw sample data conditionally based on sample type

Currently, space for raw sample data is always allocated within sample
records for both BPF output and tracepoint events. This lead

perf/core: Save raw sample data conditionally based on sample type

Currently, space for raw sample data is always allocated within sample
records for both BPF output and tracepoint events. This leads to unused
space in sample records when raw sample data is not requested.

This patch enforces checking sample type of an event in
perf_sample_save_raw_data(). So raw sample data will only be saved if
explicitly requested, reducing overhead when it is not needed.

Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper")
Signed-off-by: Yabin Cui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 2c47e7a7 13-Nov-2024 Colton Lewis <[email protected]>

perf/core: Correct perf sampling with guest VMs

Previously any PMU overflow interrupt that fired while a VCPU was
loaded was recorded as a guest event whether it truly was or not. This
resulted in n

perf/core: Correct perf sampling with guest VMs

Previously any PMU overflow interrupt that fired while a VCPU was
loaded was recorded as a guest event whether it truly was or not. This
resulted in nonsense perf recordings that did not honor
perf_event_attr.exclude_guest and recorded guest IPs where it should
have recorded host IPs.

Rework the sampling logic to only record guest samples for events with
exclude_guest = 0. This way any host-only events with exclude_guest
set will never see unexpected guest samples. The behaviour of events
with exclude_guest = 0 is unchanged.

Note that events configured to sample both host and guest may still
misattribute a PMI that arrived in the host as a guest event depending
on KVM arch and vendor behavior.

Signed-off-by: Colton Lewis <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 04782e63 13-Nov-2024 Colton Lewis <[email protected]>

perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()

For clarity, rename the arch-specific definitions of these functions
to perf_arch_* to denote they are arch-specifc. Define the
gene

perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()

For clarity, rename the arch-specific definitions of these functions
to perf_arch_* to denote they are arch-specifc. Define the
generic-named functions in one place where they can call the
arch-specific ones as needed.

Signed-off-by: Colton Lewis <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Acked-by: Thomas Richter <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Madhavan Srinivasan <[email protected]>
Acked-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 18d92bb5 22-Oct-2024 Adrian Hunter <[email protected]>

perf/core: Add aux_pause, aux_resume, aux_start_paused

Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumst

perf/core: Add aux_pause, aux_resume, aux_start_paused

Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumstances
can be useful.

The ability to pause or resume tracing when another event happens, can do
that.

Add ability for an event to "pause" or "resume" AUX area tracing.

Add aux_pause bit to perf_event_attr to indicate that, if the event
happens, the associated AUX area tracing should be paused. Ditto
aux_resume. Do not allow aux_pause and aux_resume to be set together.

Add aux_start_paused bit to perf_event_attr to indicate to an AUX area
event that it should start in a "paused" state.

Add aux_paused to struct hw_perf_event for AUX area events to keep track of
the "paused" state. aux_paused is initialized to aux_start_paused.

Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start()
callbacks. Call as needed, during __perf_event_output(). Add
aux_in_pause_resume to struct perf_buffer to prevent races with the NMI
handler. Pause/resume in NMI context will miss out if it coincides with
another pause/resume.

To use aux_pause or aux_resume, an event must be in a group with the AUX
area event as the group leader.

Example (requires Intel PT and tools patches also):

$ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data ]
$ perf script --call-trace
uname 30805 [000] 24001.058782799: name: 0x7ffc9c1865b0
uname 30805 [000] 24001.058784424: psb offs: 0
uname 30805 [000] 24001.058784424: cbr: 39 freq: 3904 MHz (139%)
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __x64_sys_newuname
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) down_read
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __cond_resched
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) in_lock_functions
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_sub
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) up_read
uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) in_lock_functions
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) preempt_count_sub
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) _copy_to_user
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_to_user_mode
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_work
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) perf_syscall_exit
uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_alloc
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_get_recursion_context
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_tp_event
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_update
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) tracing_gen_ctx_irq_test
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_event
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __perf_event_account_interrupt
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __this_cpu_preempt_check
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_output_forward
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_aux_pause
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) ring_buffer_get
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_lock
uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_unlock
uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) pt_event_stop
uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id
uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) native_write_msr
uname 30805 [000] 24001.058785463: ([kernel.kallsyms]) native_write_msr
uname 30805 [000] 24001.058785639: 0x0

Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: James Clark <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# a48a36b3 02-Aug-2024 Kan Liang <[email protected]>

perf: Add PERF_EV_CAP_READ_SCOPE

Usually, an event can be read from any CPU of the scope. It doesn't need
to be read from the advertised CPU.

Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event o

perf: Add PERF_EV_CAP_READ_SCOPE

Usually, an event can be read from any CPU of the scope. It doesn't need
to be read from the advertised CPU.

Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with
scope can be read from any active CPU in the scope.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 4ba4f1af 02-Aug-2024 Kan Liang <[email protected]>

perf: Generic hotplug support for a PMU with a scope

The perf subsystem assumes that the counters of a PMU are per-CPU. So
the user space tool reads a counter from each CPU in the system wide
mode.

perf: Generic hotplug support for a PMU with a scope

The perf subsystem assumes that the counters of a PMU are per-CPU. So
the user space tool reads a counter from each CPU in the system wide
mode. However, many PMUs don't have a per-CPU counter. The counter is
effective for a scope, e.g., a die or a socket. To address this, a
cpumask is exposed by the kernel driver to restrict to one CPU to stand
for a specific scope. In case the given CPU is removed,
the hotplug support has to be implemented for each such driver.

The codes to support the cpumask and hotplug are very similar.
- Expose a cpumask into sysfs
- Pickup another CPU in the same scope if the given CPU is removed.
- Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
- In event init, always set the CPU in the cpumask to event->cpu

Similar duplicated codes are implemented for each such PMU driver. It
would be good to introduce a generic infrastructure to avoid such
duplication.

5 popular scopes are implemented here, core, die, cluster, pkg, and
the system-wide. The scope can be set when a PMU is registered. If so, a
"cpumask" is automatically exposed for the PMU.

The "cpumask" is from the perf_online_<scope>_mask, which is to track
the active CPU for each scope. They are set when the first CPU of the
scope is online via the generic perf hotplug support. When a
corresponding CPU is removed, the perf_online_<scope>_mask is updated
accordingly and the PMU will be moved to a new CPU from the same scope
if possible.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

show more ...


# 5e9629d0 27-Aug-2024 James Clark <[email protected]>

drivers/perf: arm_spe: Use perf_allow_kernel() for permissions

Use perf_allow_kernel() for 'pa_enable' (physical addresses),
'pct_enable' (physical timestamps) and context IDs. This means that
perf_

drivers/perf: arm_spe: Use perf_allow_kernel() for permissions

Use perf_allow_kernel() for 'pa_enable' (physical addresses),
'pct_enable' (physical timestamps) and context IDs. This means that
perf_event_paranoid is now taken into account and LSM hooks can be used,
which is more consistent with other perf_event_open calls. For example
PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just
perfmon_capable().

This also indirectly fixes the following error message which is
misleading because perf_event_paranoid is not taken into account by
perfmon_capable():

$ perf record -e arm_spe/pa_enable/

Error:
Access to performance monitoring and observability operations is
limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid
setting ...

Suggested-by: Al Grant <[email protected]>
Signed-off-by: James Clark <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Will Deacon <[email protected]>

show more ...


# 7e8b2556 30-Jul-2024 Ben Gainey <[email protected]>

perf: Support PERF_SAMPLE_READ with inherit

This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of c

perf: Support PERF_SAMPLE_READ with inherit

This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.

Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.

This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.

The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.

The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).

Signed-off-by: Ben Gainey <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


# 79bd2330 30-Jul-2024 Ben Gainey <[email protected]>

perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.

nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking t

perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.

nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.

Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.

Signed-off-by: Ben Gainey <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

show more ...


12345678910>>...19