141397152SLeo Yan.. SPDX-License-Identifier: GPL-2.0
241397152SLeo Yan
341397152SLeo Yan================
441397152SLeo YanPerf ring buffer
541397152SLeo Yan================
641397152SLeo Yan
741397152SLeo Yan.. CONTENTS
841397152SLeo Yan
941397152SLeo Yan    1. Introduction
1041397152SLeo Yan
1141397152SLeo Yan    2. Ring buffer implementation
1241397152SLeo Yan    2.1  Basic algorithm
1341397152SLeo Yan    2.2  Ring buffer for different tracing modes
1441397152SLeo Yan    2.2.1       Default mode
1541397152SLeo Yan    2.2.2       Per-thread mode
1641397152SLeo Yan    2.2.3       Per-CPU mode
1741397152SLeo Yan    2.2.4       System wide mode
1841397152SLeo Yan    2.3  Accessing buffer
1941397152SLeo Yan    2.3.1       Producer-consumer model
2041397152SLeo Yan    2.3.2       Properties of the ring buffers
2141397152SLeo Yan    2.3.3       Writing samples into buffer
2241397152SLeo Yan    2.3.4       Reading samples from buffer
2341397152SLeo Yan    2.3.5       Memory synchronization
2441397152SLeo Yan
2541397152SLeo Yan    3. The mechanism of AUX ring buffer
2641397152SLeo Yan    3.1  The relationship between AUX and regular ring buffers
2741397152SLeo Yan    3.2  AUX events
2841397152SLeo Yan    3.3  Snapshot mode
2941397152SLeo Yan
3041397152SLeo Yan
3141397152SLeo Yan1. Introduction
3241397152SLeo Yan===============
3341397152SLeo Yan
3441397152SLeo YanThe ring buffer is a fundamental mechanism for data transfer.  perf uses
3541397152SLeo Yanring buffers to transfer event data from kernel to user space, another
3641397152SLeo Yankind of ring buffer which is so called auxiliary (AUX) ring buffer also
3741397152SLeo Yanplays an important role for hardware tracing with Intel PT, Arm
3841397152SLeo YanCoreSight, etc.
3941397152SLeo Yan
4041397152SLeo YanThe ring buffer implementation is critical but it's also a very
4141397152SLeo Yanchallenging work.  On the one hand, the kernel and perf tool in the user
4241397152SLeo Yanspace use the ring buffer to exchange data and stores data into data
4341397152SLeo Yanfile, thus the ring buffer needs to transfer data with high throughput;
4441397152SLeo Yanon the other hand, the ring buffer management should avoid significant
4541397152SLeo Yanoverload to distract profiling results.
4641397152SLeo Yan
4741397152SLeo YanThis documentation dives into the details for perf ring buffer with two
4841397152SLeo Yanparts: firstly it explains the perf ring buffer implementation, then the
4941397152SLeo Yansecond part discusses the AUX ring buffer mechanism.
5041397152SLeo Yan
5141397152SLeo Yan2. Ring buffer implementation
5241397152SLeo Yan=============================
5341397152SLeo Yan
5441397152SLeo Yan2.1 Basic algorithm
5541397152SLeo Yan-------------------
5641397152SLeo Yan
5741397152SLeo YanThat said, a typical ring buffer is managed by a head pointer and a tail
5841397152SLeo Yanpointer; the head pointer is manipulated by a writer and the tail
5941397152SLeo Yanpointer is updated by a reader respectively.
6041397152SLeo Yan
6141397152SLeo Yan::
6241397152SLeo Yan
6341397152SLeo Yan        +---------------------------+
6441397152SLeo Yan        |   |   |***|***|***|   |   |
6541397152SLeo Yan        +---------------------------+
6641397152SLeo Yan                `-> Tail    `-> Head
6741397152SLeo Yan
6841397152SLeo Yan        * : the data is filled by the writer.
6941397152SLeo Yan
7041397152SLeo Yan                Figure 1. Ring buffer
7141397152SLeo Yan
7241397152SLeo YanPerf uses the same way to manage its ring buffer.  In the implementation
7341397152SLeo Yanthere are two key data structures held together in a set of consecutive
7441397152SLeo Yanpages, the control structure and then the ring buffer itself.  The page
7541397152SLeo Yanwith the control structure in is known as the "user page".  Being held
7641397152SLeo Yanin continuous virtual addresses simplifies locating the ring buffer
7741397152SLeo Yanaddress, it is in the pages after the page with the user page.
7841397152SLeo Yan
7941397152SLeo YanThe control structure is named as ``perf_event_mmap_page``, it contains a
8041397152SLeo Yanhead pointer ``data_head`` and a tail pointer ``data_tail``.  When the
8141397152SLeo Yankernel starts to fill records into the ring buffer, it updates the head
8241397152SLeo Yanpointer to reserve the memory so later it can safely store events into
8341397152SLeo Yanthe buffer.  On the other side, when the user page is a writable mapping,
8441397152SLeo Yanthe perf tool has the permission to update the tail pointer after consuming
8541397152SLeo Yandata from the ring buffer.  Yet another case is for the user page's
8641397152SLeo Yanread-only mapping, which is to be addressed in the section
8741397152SLeo Yan:ref:`writing_samples_into_buffer`.
8841397152SLeo Yan
8941397152SLeo Yan::
9041397152SLeo Yan
9141397152SLeo Yan          user page                          ring buffer
9241397152SLeo Yan    +---------+---------+   +---------------------------------------+
9341397152SLeo Yan    |data_head|data_tail|...|   |   |***|***|***|***|***|   |   |   |
9441397152SLeo Yan    +---------+---------+   +---------------------------------------+
9541397152SLeo Yan        `          `----------------^                   ^
9641397152SLeo Yan         `----------------------------------------------|
9741397152SLeo Yan
9841397152SLeo Yan              * : the data is filled by the writer.
9941397152SLeo Yan
10041397152SLeo Yan                Figure 2. Perf ring buffer
10141397152SLeo Yan
10241397152SLeo YanWhen using the ``perf record`` tool, we can specify the ring buffer size
10341397152SLeo Yanwith option ``-m`` or ``--mmap-pages=``, the given size will be rounded up
10441397152SLeo Yanto a power of two that is a multiple of a page size.  Though the kernel
10541397152SLeo Yanallocates at once for all memory pages, it's deferred to map the pages
10641397152SLeo Yanto VMA area until the perf tool accesses the buffer from the user space.
10741397152SLeo YanIn other words, at the first time accesses the buffer's page from user
10841397152SLeo Yanspace in the perf tool, a data abort exception for page fault is taken
10941397152SLeo Yanand the kernel uses this occasion to map the page into process VMA
11041397152SLeo Yan(see ``perf_mmap_fault()``), thus the perf tool can continue to access
11141397152SLeo Yanthe page after returning from the exception.
11241397152SLeo Yan
11341397152SLeo Yan2.2 Ring buffer for different tracing modes
11441397152SLeo Yan-------------------------------------------
11541397152SLeo Yan
11641397152SLeo YanThe perf profiles programs with different modes: default mode, per thread
11741397152SLeo Yanmode, per cpu mode, and system wide mode.  This section describes these
11841397152SLeo Yanmodes and how the ring buffer meets requirements for them.  At last we
11941397152SLeo Yanwill review the race conditions caused by these modes.
12041397152SLeo Yan
12141397152SLeo Yan2.2.1 Default mode
12241397152SLeo Yan^^^^^^^^^^^^^^^^^^
12341397152SLeo Yan
12441397152SLeo YanUsually we execute ``perf record`` command followed by a profiling program
12541397152SLeo Yanname, like below command::
12641397152SLeo Yan
12741397152SLeo Yan        perf record test_program
12841397152SLeo Yan
12941397152SLeo YanThis command doesn't specify any options for CPU and thread modes, the
13041397152SLeo Yanperf tool applies the default mode on the perf event.  It maps all the
13141397152SLeo YanCPUs in the system and the profiled program's PID on the perf event, and
13241397152SLeo Yanit enables inheritance mode on the event so that child tasks inherits
13341397152SLeo Yanthe events.  As a result, the perf event is attributed as::
13441397152SLeo Yan
13541397152SLeo Yan    evsel::cpus::map[]    = { 0 .. _SC_NPROCESSORS_ONLN-1 }
13641397152SLeo Yan    evsel::threads::map[] = { pid }
13741397152SLeo Yan    evsel::attr::inherit  = 1
13841397152SLeo Yan
13941397152SLeo YanThese attributions finally will be reflected on the deployment of ring
14041397152SLeo Yanbuffers.  As shown below, the perf tool allocates individual ring buffer
14141397152SLeo Yanfor each CPU, but it only enables events for the profiled program rather
14241397152SLeo Yanthan for all threads in the system.  The *T1* thread represents the
14341397152SLeo Yanthread context of the 'test_program', whereas *T2* and *T3* are irrelevant
14441397152SLeo Yanthreads in the system.   The perf samples are exclusively collected for
14541397152SLeo Yanthe *T1* thread and stored in the ring buffer associated with the CPU on
14641397152SLeo Yanwhich the *T1* thread is running.
14741397152SLeo Yan
14841397152SLeo Yan::
14941397152SLeo Yan
15041397152SLeo Yan              T1                      T2                 T1
15141397152SLeo Yan            +----+              +-----------+          +----+
15241397152SLeo Yan    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
15341397152SLeo Yan            +----+--------------+-----------+----------+----+-------->
15441397152SLeo Yan              |                                          |
15541397152SLeo Yan              v                                          v
15641397152SLeo Yan            +-----------------------------------------------------+
15741397152SLeo Yan            |                  Ring buffer 0                      |
15841397152SLeo Yan            +-----------------------------------------------------+
15941397152SLeo Yan
16041397152SLeo Yan                   T1
16141397152SLeo Yan                 +-----+
16241397152SLeo Yan    CPU1         |xxxxx|
16341397152SLeo Yan            -----+-----+--------------------------------------------->
16441397152SLeo Yan                    |
16541397152SLeo Yan                    v
16641397152SLeo Yan            +-----------------------------------------------------+
16741397152SLeo Yan            |                  Ring buffer 1                      |
16841397152SLeo Yan            +-----------------------------------------------------+
16941397152SLeo Yan
17041397152SLeo Yan                                        T1              T3
17141397152SLeo Yan                                      +----+        +-------+
17241397152SLeo Yan    CPU2                              |xxxx|        |xxxxxxx|
17341397152SLeo Yan            --------------------------+----+--------+-------+-------->
17441397152SLeo Yan                                        |
17541397152SLeo Yan                                        v
17641397152SLeo Yan            +-----------------------------------------------------+
17741397152SLeo Yan            |                  Ring buffer 2                      |
17841397152SLeo Yan            +-----------------------------------------------------+
17941397152SLeo Yan
18041397152SLeo Yan                              T1
18141397152SLeo Yan                       +--------------+
18241397152SLeo Yan    CPU3               |xxxxxxxxxxxxxx|
18341397152SLeo Yan            -----------+--------------+------------------------------>
18441397152SLeo Yan                              |
18541397152SLeo Yan                              v
18641397152SLeo Yan            +-----------------------------------------------------+
18741397152SLeo Yan            |                  Ring buffer 3                      |
18841397152SLeo Yan            +-----------------------------------------------------+
18941397152SLeo Yan
19041397152SLeo Yan	    T1: Thread 1; T2: Thread 2; T3: Thread 3
19141397152SLeo Yan	    x: Thread is in running state
19241397152SLeo Yan
19341397152SLeo Yan                Figure 3. Ring buffer for default mode
19441397152SLeo Yan
19541397152SLeo Yan2.2.2 Per-thread mode
19641397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^
19741397152SLeo Yan
19841397152SLeo YanBy specifying option ``--per-thread`` in perf command, e.g.
19941397152SLeo Yan
20041397152SLeo Yan::
20141397152SLeo Yan
20241397152SLeo Yan        perf record --per-thread test_program
20341397152SLeo Yan
20441397152SLeo YanThe perf event doesn't map to any CPUs and is only bound to the
20541397152SLeo Yanprofiled process, thus, the perf event's attributions are::
20641397152SLeo Yan
20741397152SLeo Yan    evsel::cpus::map[0]   = { -1 }
20841397152SLeo Yan    evsel::threads::map[] = { pid }
20941397152SLeo Yan    evsel::attr::inherit  = 0
21041397152SLeo Yan
21141397152SLeo YanIn this mode, a single ring buffer is allocated for the profiled thread;
21241397152SLeo Yanif the thread is scheduled on a CPU, the events on that CPU will be
21341397152SLeo Yanenabled; and if the thread is scheduled out from the CPU, the events on
21441397152SLeo Yanthe CPU will be disabled.  When the thread is migrated from one CPU to
21541397152SLeo Yananother, the events are to be disabled on the previous CPU and enabled
21641397152SLeo Yanon the next CPU correspondingly.
21741397152SLeo Yan
21841397152SLeo Yan::
21941397152SLeo Yan
22041397152SLeo Yan              T1                      T2                 T1
22141397152SLeo Yan            +----+              +-----------+          +----+
22241397152SLeo Yan    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
22341397152SLeo Yan            +----+--------------+-----------+----------+----+-------->
22441397152SLeo Yan              |                                           |
22541397152SLeo Yan              |    T1                                     |
22641397152SLeo Yan              |  +-----+                                  |
22741397152SLeo Yan    CPU1      |  |xxxxx|                                  |
22841397152SLeo Yan            --|--+-----+----------------------------------|---------->
22941397152SLeo Yan              |     |                                     |
23041397152SLeo Yan              |     |                   T1            T3  |
23141397152SLeo Yan              |     |                 +----+        +---+ |
23241397152SLeo Yan    CPU2      |     |                 |xxxx|        |xxx| |
23341397152SLeo Yan            --|-----|-----------------+----+--------+---+-|---------->
23441397152SLeo Yan              |     |                   |                 |
23541397152SLeo Yan              |     |         T1        |                 |
23641397152SLeo Yan              |     |  +--------------+ |                 |
23741397152SLeo Yan    CPU3      |     |  |xxxxxxxxxxxxxx| |                 |
23841397152SLeo Yan            --|-----|--+--------------+-|-----------------|---------->
23941397152SLeo Yan              |     |         |         |                 |
24041397152SLeo Yan              v     v         v         v                 v
24141397152SLeo Yan            +-----------------------------------------------------+
24241397152SLeo Yan            |                  Ring buffer                        |
24341397152SLeo Yan            +-----------------------------------------------------+
24441397152SLeo Yan
24541397152SLeo Yan            T1: Thread 1
24641397152SLeo Yan            x: Thread is in running state
24741397152SLeo Yan
24841397152SLeo Yan                Figure 4. Ring buffer for per-thread mode
24941397152SLeo Yan
25041397152SLeo YanWhen perf runs in per-thread mode, a ring buffer is allocated for the
25141397152SLeo Yanprofiled thread *T1*.  The ring buffer is dedicated for thread *T1*, if the
25241397152SLeo Yanthread *T1* is running, the perf events will be recorded into the ring
25341397152SLeo Yanbuffer; when the thread is sleeping, all associated events will be
25441397152SLeo Yandisabled, thus no trace data will be recorded into the ring buffer.
25541397152SLeo Yan
25641397152SLeo Yan2.2.3 Per-CPU mode
25741397152SLeo Yan^^^^^^^^^^^^^^^^^^
25841397152SLeo Yan
25941397152SLeo YanThe option ``-C`` is used to collect samples on the list of CPUs, for
26041397152SLeo Yanexample the below perf command receives option ``-C 0,2``::
26141397152SLeo Yan
26241397152SLeo Yan	perf record -C 0,2 test_program
26341397152SLeo Yan
26441397152SLeo YanIt maps the perf event to CPUs 0 and 2, and the event is not associated to any
26541397152SLeo YanPID.  Thus the perf event attributions are set as::
26641397152SLeo Yan
26741397152SLeo Yan    evsel::cpus::map[0]   = { 0, 2 }
26841397152SLeo Yan    evsel::threads::map[] = { -1 }
26941397152SLeo Yan    evsel::attr::inherit  = 0
27041397152SLeo Yan
27141397152SLeo YanThis results in the session of ``perf record`` will sample all threads on CPU0
27241397152SLeo Yanand CPU2, and be terminated until test_program exits.  Even there have tasks
27341397152SLeo Yanrunning on CPU1 and CPU3, since the ring buffer is absent for them, any
27441397152SLeo Yanactivities on these two CPUs will be ignored.  A usage case is to combine the
27541397152SLeo Yanoptions for per-thread mode and per-CPU mode, e.g. the options ``–C 0,2`` and
27641397152SLeo Yan``––per–thread`` are specified together, the samples are recorded only when
27741397152SLeo Yanthe profiled thread is scheduled on any of the listed CPUs.
27841397152SLeo Yan
27941397152SLeo Yan::
28041397152SLeo Yan
28141397152SLeo Yan              T1                      T2                 T1
28241397152SLeo Yan            +----+              +-----------+          +----+
28341397152SLeo Yan    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
28441397152SLeo Yan            +----+--------------+-----------+----------+----+-------->
28541397152SLeo Yan              |                       |                  |
28641397152SLeo Yan              v                       v                  v
28741397152SLeo Yan            +-----------------------------------------------------+
28841397152SLeo Yan            |                  Ring buffer 0                      |
28941397152SLeo Yan            +-----------------------------------------------------+
29041397152SLeo Yan
29141397152SLeo Yan                   T1
29241397152SLeo Yan                 +-----+
29341397152SLeo Yan    CPU1         |xxxxx|
29441397152SLeo Yan            -----+-----+--------------------------------------------->
29541397152SLeo Yan
29641397152SLeo Yan                                        T1              T3
29741397152SLeo Yan                                      +----+        +-------+
29841397152SLeo Yan    CPU2                              |xxxx|        |xxxxxxx|
29941397152SLeo Yan            --------------------------+----+--------+-------+-------->
30041397152SLeo Yan                                        |               |
30141397152SLeo Yan                                        v               v
30241397152SLeo Yan            +-----------------------------------------------------+
30341397152SLeo Yan            |                  Ring buffer 1                      |
30441397152SLeo Yan            +-----------------------------------------------------+
30541397152SLeo Yan
30641397152SLeo Yan                              T1
30741397152SLeo Yan                       +--------------+
30841397152SLeo Yan    CPU3               |xxxxxxxxxxxxxx|
30941397152SLeo Yan            -----------+--------------+------------------------------>
31041397152SLeo Yan
31141397152SLeo Yan            T1: Thread 1; T2: Thread 2; T3: Thread 3
31241397152SLeo Yan            x: Thread is in running state
31341397152SLeo Yan
31441397152SLeo Yan                Figure 5. Ring buffer for per-CPU mode
31541397152SLeo Yan
31641397152SLeo Yan2.2.4 System wide mode
31741397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^
31841397152SLeo Yan
31941397152SLeo YanBy using option ``–a`` or ``––all–cpus``, perf collects samples on all CPUs
32041397152SLeo Yanfor all tasks, we call it as the system wide mode, the command is::
32141397152SLeo Yan
32241397152SLeo Yan        perf record -a test_program
32341397152SLeo Yan
32441397152SLeo YanSimilar to the per-CPU mode, the perf event doesn't bind to any PID, and
32541397152SLeo Yanit maps to all CPUs in the system::
32641397152SLeo Yan
32741397152SLeo Yan   evsel::cpus::map[]    = { 0 .. _SC_NPROCESSORS_ONLN-1 }
32841397152SLeo Yan   evsel::threads::map[] = { -1 }
32941397152SLeo Yan   evsel::attr::inherit  = 0
33041397152SLeo Yan
33141397152SLeo YanIn the system wide mode, every CPU has its own ring buffer, all threads
33241397152SLeo Yanare monitored during the running state and the samples are recorded into
33341397152SLeo Yanthe ring buffer belonging to the CPU which the events occurred on.
33441397152SLeo Yan
33541397152SLeo Yan::
33641397152SLeo Yan
33741397152SLeo Yan              T1                      T2                 T1
33841397152SLeo Yan            +----+              +-----------+          +----+
33941397152SLeo Yan    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
34041397152SLeo Yan            +----+--------------+-----------+----------+----+-------->
34141397152SLeo Yan              |                       |                  |
34241397152SLeo Yan              v                       v                  v
34341397152SLeo Yan            +-----------------------------------------------------+
34441397152SLeo Yan            |                  Ring buffer 0                      |
34541397152SLeo Yan            +-----------------------------------------------------+
34641397152SLeo Yan
34741397152SLeo Yan                   T1
34841397152SLeo Yan                 +-----+
34941397152SLeo Yan    CPU1         |xxxxx|
35041397152SLeo Yan            -----+-----+--------------------------------------------->
35141397152SLeo Yan                    |
35241397152SLeo Yan                    v
35341397152SLeo Yan            +-----------------------------------------------------+
35441397152SLeo Yan            |                  Ring buffer 1                      |
35541397152SLeo Yan            +-----------------------------------------------------+
35641397152SLeo Yan
35741397152SLeo Yan                                        T1              T3
35841397152SLeo Yan                                      +----+        +-------+
35941397152SLeo Yan    CPU2                              |xxxx|        |xxxxxxx|
36041397152SLeo Yan            --------------------------+----+--------+-------+-------->
36141397152SLeo Yan                                        |               |
36241397152SLeo Yan                                        v               v
36341397152SLeo Yan            +-----------------------------------------------------+
36441397152SLeo Yan            |                  Ring buffer 2                      |
36541397152SLeo Yan            +-----------------------------------------------------+
36641397152SLeo Yan
36741397152SLeo Yan                              T1
36841397152SLeo Yan                       +--------------+
36941397152SLeo Yan    CPU3               |xxxxxxxxxxxxxx|
37041397152SLeo Yan            -----------+--------------+------------------------------>
37141397152SLeo Yan                              |
37241397152SLeo Yan                              v
37341397152SLeo Yan            +-----------------------------------------------------+
37441397152SLeo Yan            |                  Ring buffer 3                      |
37541397152SLeo Yan            +-----------------------------------------------------+
37641397152SLeo Yan
37741397152SLeo Yan            T1: Thread 1; T2: Thread 2; T3: Thread 3
37841397152SLeo Yan            x: Thread is in running state
37941397152SLeo Yan
38041397152SLeo Yan                Figure 6. Ring buffer for system wide mode
38141397152SLeo Yan
38241397152SLeo Yan2.3 Accessing buffer
38341397152SLeo Yan--------------------
38441397152SLeo Yan
38541397152SLeo YanBased on the understanding of how the ring buffer is allocated in
38641397152SLeo Yanvarious modes, this section explains access the ring buffer.
38741397152SLeo Yan
38841397152SLeo Yan2.3.1 Producer-consumer model
38941397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
39041397152SLeo Yan
39141397152SLeo YanIn the Linux kernel, the PMU events can produce samples which are stored
39241397152SLeo Yaninto the ring buffer; the perf command in user space consumes the
39341397152SLeo Yansamples by reading out data from the ring buffer and finally saves the
39441397152SLeo Yandata into the file for post analysis.  It’s a typical producer-consumer
39541397152SLeo Yanmodel for using the ring buffer.
39641397152SLeo Yan
39741397152SLeo YanThe perf process polls on the PMU events and sleeps when no events are
39841397152SLeo Yanincoming.  To prevent frequent exchanges between the kernel and user
39941397152SLeo Yanspace, the kernel event core layer introduces a watermark, which is
40041397152SLeo Yanstored in the ``perf_buffer::watermark``.  When a sample is recorded into
40141397152SLeo Yanthe ring buffer, and if the used buffer exceeds the watermark, the
40241397152SLeo Yankernel wakes up the perf process to read samples from the ring buffer.
40341397152SLeo Yan
40441397152SLeo Yan::
40541397152SLeo Yan
40641397152SLeo Yan                       Perf
40741397152SLeo Yan                       / | Read samples
40841397152SLeo Yan             Polling  /  `--------------|               Ring buffer
40941397152SLeo Yan                     v                  v    ;---------------------v
41041397152SLeo Yan    +----------------+     +---------+---------+   +-------------------+
41141397152SLeo Yan    |Event wait queue|     |data_head|data_tail|   |***|***|   |   |***|
41241397152SLeo Yan    +----------------+     +---------+---------+   +-------------------+
41341397152SLeo Yan             ^                  ^ `------------------------^
41441397152SLeo Yan             | Wake up tasks    | Store samples
41541397152SLeo Yan          +-----------------------------+
41641397152SLeo Yan          |  Kernel event core layer    |
41741397152SLeo Yan          +-----------------------------+
41841397152SLeo Yan
41941397152SLeo Yan              * : the data is filled by the writer.
42041397152SLeo Yan
42141397152SLeo Yan                Figure 7. Writing and reading the ring buffer
42241397152SLeo Yan
42341397152SLeo YanWhen the kernel event core layer notifies the user space, because
42441397152SLeo Yanmultiple events might share the same ring buffer for recording samples,
42541397152SLeo Yanthe core layer iterates every event associated with the ring buffer and
42641397152SLeo Yanwakes up tasks waiting on the event.  This is fulfilled by the kernel
42741397152SLeo Yanfunction ``ring_buffer_wakeup()``.
42841397152SLeo Yan
42941397152SLeo YanAfter the perf process is woken up, it starts to check the ring buffers
43041397152SLeo Yanone by one, if it finds any ring buffer containing samples it will read
43141397152SLeo Yanout the samples for statistics or saving into the data file.  Given the
43241397152SLeo Yanperf process is able to run on any CPU, this leads to the ring buffer
43341397152SLeo Yanpotentially being accessed from multiple CPUs simultaneously, which
43441397152SLeo Yancauses race conditions.  The race condition handling is described in the
43541397152SLeo Yansection :ref:`memory_synchronization`.
43641397152SLeo Yan
43741397152SLeo Yan2.3.2 Properties of the ring buffers
43841397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43941397152SLeo Yan
44041397152SLeo YanLinux kernel supports two write directions for the ring buffer: forward and
44141397152SLeo Yanbackward.  The forward writing saves samples from the beginning of the ring
44241397152SLeo Yanbuffer, the backward writing stores data from the end of the ring buffer with
44341397152SLeo Yanthe reversed direction.  The perf tool determines the writing direction.
44441397152SLeo Yan
44541397152SLeo YanAdditionally, the tool can map buffers in either read-write mode or read-only
44641397152SLeo Yanmode to the user space.
44741397152SLeo Yan
44841397152SLeo YanThe ring buffer in the read-write mode is mapped with the property
44941397152SLeo Yan``PROT_READ | PROT_WRITE``.  With the write permission, the perf tool
45041397152SLeo Yanupdates the ``data_tail`` to indicate the data start position.  Combining
45141397152SLeo Yanwith the head pointer ``data_head``, which works as the end position of
45241397152SLeo Yanthe current data, the perf tool can easily know where read out the data
45341397152SLeo Yanfrom.
45441397152SLeo Yan
45541397152SLeo YanAlternatively, in the read-only mode, only the kernel keeps to update
45641397152SLeo Yanthe ``data_head`` while the user space cannot access the ``data_tail`` due
45741397152SLeo Yanto the mapping property ``PROT_READ``.
45841397152SLeo Yan
45941397152SLeo YanAs a result, the matrix below illustrates the various combinations of
46041397152SLeo Yandirection and mapping characteristics.  The perf tool employs two of these
46141397152SLeo Yancombinations to support buffer types: the non-overwrite buffer and the
46241397152SLeo Yanoverwritable buffer.
46341397152SLeo Yan
46441397152SLeo Yan.. list-table::
46541397152SLeo Yan   :widths: 1 1 1
46641397152SLeo Yan   :header-rows: 1
46741397152SLeo Yan
46841397152SLeo Yan   * - Mapping mode
46941397152SLeo Yan     - Forward
47041397152SLeo Yan     - Backward
47141397152SLeo Yan   * - read-write
47241397152SLeo Yan     - Non-overwrite ring buffer
47341397152SLeo Yan     - Not used
47441397152SLeo Yan   * - read-only
47541397152SLeo Yan     - Not used
47641397152SLeo Yan     - Overwritable ring buffer
47741397152SLeo Yan
47841397152SLeo YanThe non-overwrite ring buffer uses the read-write mapping with forward
47941397152SLeo Yanwriting.  It starts to save data from the beginning of the ring buffer
48041397152SLeo Yanand wrap around when overflow, which is used with the read-write mode in
48141397152SLeo Yanthe normal ring buffer.  When the consumer doesn't keep up with the
48241397152SLeo Yanproducer, it would lose some data, the kernel keeps how many records it
48341397152SLeo Yanlost and generates the ``PERF_RECORD_LOST`` records in the next time
48441397152SLeo Yanwhen it finds a space in the ring buffer.
48541397152SLeo Yan
48641397152SLeo YanThe overwritable ring buffer uses the backward writing with the
48741397152SLeo Yanread-only mode.  It saves the data from the end of the ring buffer and
48841397152SLeo Yanthe ``data_head`` keeps the position of current data, the perf always
48941397152SLeo Yanknows where it starts to read and until the end of the ring buffer, thus
49041397152SLeo Yanit don't need the ``data_tail``.  In this mode, it will not generate the
49141397152SLeo Yan``PERF_RECORD_LOST`` records.
49241397152SLeo Yan
49341397152SLeo Yan.. _writing_samples_into_buffer:
49441397152SLeo Yan
49541397152SLeo Yan2.3.3 Writing samples into buffer
49641397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49741397152SLeo Yan
49841397152SLeo YanWhen a sample is taken and saved into the ring buffer, the kernel
49941397152SLeo Yanprepares sample fields based on the sample type; then it prepares the
50041397152SLeo Yaninfo for writing ring buffer which is stored in the structure
50141397152SLeo Yan``perf_output_handle``.  In the end, the kernel outputs the sample into
50241397152SLeo Yanthe ring buffer and updates the head pointer in the user page so the
50341397152SLeo Yanperf tool can see the latest value.
50441397152SLeo Yan
50541397152SLeo YanThe structure ``perf_output_handle`` serves as a temporary context for
50641397152SLeo Yantracking the information related to the buffer.  The advantages of it is
50741397152SLeo Yanthat it enables concurrent writing to the buffer by different events.
50841397152SLeo YanFor example, a software event and a hardware PMU event both are enabled
50941397152SLeo Yanfor profiling, two instances of ``perf_output_handle`` serve as separate
51041397152SLeo Yancontexts for the software event and the hardware event respectively.
51141397152SLeo YanThis allows each event to reserve its own memory space for populating
51241397152SLeo Yanthe record data.
51341397152SLeo Yan
51441397152SLeo Yan2.3.4 Reading samples from buffer
51541397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
51641397152SLeo Yan
51741397152SLeo YanIn the user space, the perf tool utilizes the ``perf_event_mmap_page``
51841397152SLeo Yanstructure to handle the head and tail of the buffer.  It also uses
51941397152SLeo Yan``perf_mmap`` structure to keep track of a context for the ring buffer, this
52041397152SLeo Yancontext includes information about the buffer's starting and ending
52141397152SLeo Yanaddresses.  Additionally, the mask value can be utilized to compute the
52241397152SLeo Yancircular buffer pointer even for an overflow.
52341397152SLeo Yan
52441397152SLeo YanSimilar to the kernel, the perf tool in the user space first reads out
52541397152SLeo Yanthe recorded data from the ring buffer, and then updates the buffer's
52641397152SLeo Yantail pointer ``perf_event_mmap_page::data_tail``.
52741397152SLeo Yan
52841397152SLeo Yan.. _memory_synchronization:
52941397152SLeo Yan
53041397152SLeo Yan2.3.5 Memory synchronization
53141397152SLeo Yan^^^^^^^^^^^^^^^^^^^^^^^^^^^^
53241397152SLeo Yan
53341397152SLeo YanThe modern CPUs with relaxed memory model cannot promise the memory
53441397152SLeo Yanordering, this means it’s possible to access the ring buffer and the
53541397152SLeo Yan``perf_event_mmap_page`` structure out of order.  To assure the specific
53641397152SLeo Yansequence for memory accessing perf ring buffer, memory barriers are
53741397152SLeo Yanused to assure the data dependency.  The rationale for the memory
53841397152SLeo Yansynchronization is as below::
53941397152SLeo Yan
54041397152SLeo Yan  Kernel                          User space
54141397152SLeo Yan
54241397152SLeo Yan  if (LOAD ->data_tail) {         LOAD ->data_head
54341397152SLeo Yan                   (A)            smp_rmb()        (C)
54441397152SLeo Yan    STORE $data                   LOAD $data
54541397152SLeo Yan    smp_wmb()      (B)            smp_mb()         (D)
54641397152SLeo Yan    STORE ->data_head             STORE ->data_tail
54741397152SLeo Yan  }
54841397152SLeo Yan
54941397152SLeo YanThe comments in tools/include/linux/ring_buffer.h gives nice description
55041397152SLeo Yanfor why and how to use memory barriers, here we will just provide an
55141397152SLeo Yanalternative explanation:
55241397152SLeo Yan
55341397152SLeo Yan(A) is a control dependency so that CPU assures order between checking
55441397152SLeo Yanpointer ``perf_event_mmap_page::data_tail`` and filling sample into ring
55541397152SLeo Yanbuffer;
55641397152SLeo Yan
55741397152SLeo Yan(D) pairs with (A).  (D) separates the ring buffer data reading from
55841397152SLeo Yanwriting the pointer ``data_tail``, perf tool first consumes samples and then
55941397152SLeo Yantells the kernel that the data chunk has been released.  Since a reading
56041397152SLeo Yanoperation is followed by a writing operation, thus (D) is a full memory
56141397152SLeo Yanbarrier.
56241397152SLeo Yan
56341397152SLeo Yan(B) is a writing barrier in the middle of two writing operations, which
56441397152SLeo Yanmakes sure that recording a sample must be prior to updating the head
56541397152SLeo Yanpointer.
56641397152SLeo Yan
56741397152SLeo Yan(C) pairs with (B).  (C) is a read memory barrier to ensure the head
56841397152SLeo Yanpointer is fetched before reading samples.
56941397152SLeo Yan
57041397152SLeo YanTo implement the above algorithm, the ``perf_output_put_handle()`` function
57141397152SLeo Yanin the kernel and two helpers ``ring_buffer_read_head()`` and
57241397152SLeo Yan``ring_buffer_write_tail()`` in the user space are introduced, they rely
57341397152SLeo Yanon memory barriers as described above to ensure the data dependency.
57441397152SLeo Yan
57541397152SLeo YanSome architectures support one-way permeable barrier with load-acquire
57641397152SLeo Yanand store-release operations, these barriers are more relaxed with less
57741397152SLeo Yanperformance penalty, so (C) and (D) can be optimized to use barriers
57841397152SLeo Yan``smp_load_acquire()`` and ``smp_store_release()`` respectively.
57941397152SLeo Yan
58041397152SLeo YanIf an architecture doesn’t support load-acquire and store-release in its
58141397152SLeo Yanmemory model, it will roll back to the old fashion of memory barrier
58241397152SLeo Yanoperations.  In this case, ``smp_load_acquire()`` encapsulates
58341397152SLeo Yan``READ_ONCE()`` + ``smp_mb()``, since ``smp_mb()`` is costly,
58441397152SLeo Yan``ring_buffer_read_head()`` doesn't invoke ``smp_load_acquire()`` and it uses
58541397152SLeo Yanthe barriers ``READ_ONCE()`` + ``smp_rmb()`` instead.
58641397152SLeo Yan
58741397152SLeo Yan3. The mechanism of AUX ring buffer
58841397152SLeo Yan===================================
58941397152SLeo Yan
59041397152SLeo YanIn this chapter, we will explain the implementation of the AUX ring
59141397152SLeo Yanbuffer.  In the first part it will discuss the connection between the
59241397152SLeo YanAUX ring buffer and the regular ring buffer, then the second part will
59341397152SLeo Yanexamine how the AUX ring buffer co-works with the regular ring buffer,
59441397152SLeo Yanas well as the additional features introduced by the AUX ring buffer for
59541397152SLeo Yanthe sampling mechanism.
59641397152SLeo Yan
59741397152SLeo Yan3.1 The relationship between AUX and regular ring buffers
59841397152SLeo Yan---------------------------------------------------------
59941397152SLeo Yan
60041397152SLeo YanGenerally, the AUX ring buffer is an auxiliary for the regular ring
60141397152SLeo Yanbuffer.  The regular ring buffer is primarily used to store the event
60241397152SLeo Yansamples and every event format complies with the definition in the
60341397152SLeo Yanunion ``perf_event``; the AUX ring buffer is for recording the hardware
60441397152SLeo Yantrace data and the trace data format is hardware IP dependent.
60541397152SLeo Yan
60641397152SLeo YanThe general use and advantage of the AUX ring buffer is that it is
60741397152SLeo Yanwritten directly by hardware rather than by the kernel.  For example,
60841397152SLeo Yanregular profile samples that write to the regular ring buffer cause an
60941397152SLeo Yaninterrupt.  Tracing execution requires a high number of samples and
61041397152SLeo Yanusing interrupts would be overwhelming for the regular ring buffer
61141397152SLeo Yanmechanism.  Having an AUX buffer allows for a region of memory more
61241397152SLeo Yandecoupled from the kernel and written to directly by hardware tracing.
61341397152SLeo Yan
61441397152SLeo YanThe AUX ring buffer reuses the same algorithm with the regular ring
61541397152SLeo Yanbuffer for the buffer management.  The control structure
61641397152SLeo Yan``perf_event_mmap_page`` extends the new fields ``aux_head`` and ``aux_tail``
61741397152SLeo Yanfor the head and tail pointers of the AUX ring buffer.
61841397152SLeo Yan
61941397152SLeo YanDuring the initialisation phase, besides the mmap()-ed regular ring
62041397152SLeo Yanbuffer, the perf tool invokes a second syscall in the
62141397152SLeo Yan``auxtrace_mmap__mmap()`` function for the mmap of the AUX buffer with
62241397152SLeo Yannon-zero file offset; ``rb_alloc_aux()`` in the kernel allocates pages
62341397152SLeo Yancorrespondingly, these pages will be deferred to map into VMA when
62441397152SLeo Yanhandling the page fault, which is the same lazy mechanism with the
62541397152SLeo Yanregular ring buffer.
62641397152SLeo Yan
62741397152SLeo YanAUX events and AUX trace data are two different things.  Let's see an
62841397152SLeo Yanexample::
62941397152SLeo Yan
630*4a29fa26SJames Clark        perf record -a -e cycles -e cs_etm// -- sleep 2
63141397152SLeo Yan
63241397152SLeo YanThe above command enables two events: one is the event *cycles* from PMU
63341397152SLeo Yanand another is the AUX event *cs_etm* from Arm CoreSight, both are saved
63441397152SLeo Yaninto the regular ring buffer while the CoreSight's AUX trace data is
63541397152SLeo Yanstored in the AUX ring buffer.
63641397152SLeo Yan
63741397152SLeo YanAs a result, we can see the regular ring buffer and the AUX ring buffer
63841397152SLeo Yanare allocated in pairs.  The perf in default mode allocates the regular
63941397152SLeo Yanring buffer and the AUX ring buffer per CPU-wise, which is the same as
64041397152SLeo Yanthe system wide mode, however, the default mode records samples only for
64141397152SLeo Yanthe profiled program, whereas the latter mode profiles for all programs
64241397152SLeo Yanin the system.  For per-thread mode, the perf tool allocates only one
64341397152SLeo Yanregular ring buffer and one AUX ring buffer for the whole session.  For
64441397152SLeo Yanthe per-CPU mode, the perf allocates two kinds of ring buffers for
64541397152SLeo Yanselected CPUs specified by the option ``-C``.
64641397152SLeo Yan
64741397152SLeo YanThe below figure demonstrates the buffers' layout in the system wide
64841397152SLeo Yanmode; if there are any activities on one CPU, the AUX event samples and
64941397152SLeo Yanthe hardware trace data will be recorded into the dedicated buffers for
65041397152SLeo Yanthe CPU.
65141397152SLeo Yan
65241397152SLeo Yan::
65341397152SLeo Yan
65441397152SLeo Yan              T1                      T2                 T1
65541397152SLeo Yan            +----+              +-----------+          +----+
65641397152SLeo Yan    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
65741397152SLeo Yan            +----+--------------+-----------+----------+----+-------->
65841397152SLeo Yan              |                       |                  |
65941397152SLeo Yan              v                       v                  v
66041397152SLeo Yan            +-----------------------------------------------------+
66141397152SLeo Yan            |                  Ring buffer 0                      |
66241397152SLeo Yan            +-----------------------------------------------------+
66341397152SLeo Yan              |                       |                  |
66441397152SLeo Yan              v                       v                  v
66541397152SLeo Yan            +-----------------------------------------------------+
66641397152SLeo Yan            |               AUX Ring buffer 0                     |
66741397152SLeo Yan            +-----------------------------------------------------+
66841397152SLeo Yan
66941397152SLeo Yan                   T1
67041397152SLeo Yan                 +-----+
67141397152SLeo Yan    CPU1         |xxxxx|
67241397152SLeo Yan            -----+-----+--------------------------------------------->
67341397152SLeo Yan                    |
67441397152SLeo Yan                    v
67541397152SLeo Yan            +-----------------------------------------------------+
67641397152SLeo Yan            |                  Ring buffer 1                      |
67741397152SLeo Yan            +-----------------------------------------------------+
67841397152SLeo Yan                    |
67941397152SLeo Yan                    v
68041397152SLeo Yan            +-----------------------------------------------------+
68141397152SLeo Yan            |               AUX Ring buffer 1                     |
68241397152SLeo Yan            +-----------------------------------------------------+
68341397152SLeo Yan
68441397152SLeo Yan                                        T1              T3
68541397152SLeo Yan                                      +----+        +-------+
68641397152SLeo Yan    CPU2                              |xxxx|        |xxxxxxx|
68741397152SLeo Yan            --------------------------+----+--------+-------+-------->
68841397152SLeo Yan                                        |               |
68941397152SLeo Yan                                        v               v
69041397152SLeo Yan            +-----------------------------------------------------+
69141397152SLeo Yan            |                  Ring buffer 2                      |
69241397152SLeo Yan            +-----------------------------------------------------+
69341397152SLeo Yan                                        |               |
69441397152SLeo Yan                                        v               v
69541397152SLeo Yan            +-----------------------------------------------------+
69641397152SLeo Yan            |               AUX Ring buffer 2                     |
69741397152SLeo Yan            +-----------------------------------------------------+
69841397152SLeo Yan
69941397152SLeo Yan                              T1
70041397152SLeo Yan                       +--------------+
70141397152SLeo Yan    CPU3               |xxxxxxxxxxxxxx|
70241397152SLeo Yan            -----------+--------------+------------------------------>
70341397152SLeo Yan                              |
70441397152SLeo Yan                              v
70541397152SLeo Yan            +-----------------------------------------------------+
70641397152SLeo Yan            |                  Ring buffer 3                      |
70741397152SLeo Yan            +-----------------------------------------------------+
70841397152SLeo Yan                              |
70941397152SLeo Yan                              v
71041397152SLeo Yan            +-----------------------------------------------------+
71141397152SLeo Yan            |               AUX Ring buffer 3                     |
71241397152SLeo Yan            +-----------------------------------------------------+
71341397152SLeo Yan
71441397152SLeo Yan            T1: Thread 1; T2: Thread 2; T3: Thread 3
71541397152SLeo Yan            x: Thread is in running state
71641397152SLeo Yan
71741397152SLeo Yan                Figure 8. AUX ring buffer for system wide mode
71841397152SLeo Yan
71941397152SLeo Yan3.2 AUX events
72041397152SLeo Yan--------------
72141397152SLeo Yan
72241397152SLeo YanSimilar to ``perf_output_begin()`` and ``perf_output_end()``'s working for the
72341397152SLeo Yanregular ring buffer, ``perf_aux_output_begin()`` and ``perf_aux_output_end()``
72441397152SLeo Yanserve for the AUX ring buffer for processing the hardware trace data.
72541397152SLeo Yan
72641397152SLeo YanOnce the hardware trace data is stored into the AUX ring buffer, the PMU
72741397152SLeo Yandriver will stop hardware tracing by calling the ``pmu::stop()`` callback.
72841397152SLeo YanSimilar to the regular ring buffer, the AUX ring buffer needs to apply
72941397152SLeo Yanthe memory synchronization mechanism as discussed in the section
73041397152SLeo Yan:ref:`memory_synchronization`.  Since the AUX ring buffer is managed by the
73141397152SLeo YanPMU driver, the barrier (B), which is a writing barrier to ensure the trace
73241397152SLeo Yandata is externally visible prior to updating the head pointer, is asked
73341397152SLeo Yanto be implemented in the PMU driver.
73441397152SLeo Yan
73541397152SLeo YanThen ``pmu::stop()`` can safely call the ``perf_aux_output_end()`` function to
73641397152SLeo Yanfinish two things:
73741397152SLeo Yan
73841397152SLeo Yan- It fills an event ``PERF_RECORD_AUX`` into the regular ring buffer, this
73941397152SLeo Yan  event delivers the information of the start address and data size for a
74041397152SLeo Yan  chunk of hardware trace data has been stored into the AUX ring buffer;
74141397152SLeo Yan
74241397152SLeo Yan- Since the hardware trace driver has stored new trace data into the AUX
74341397152SLeo Yan  ring buffer, the argument *size* indicates how many bytes have been
74441397152SLeo Yan  consumed by the hardware tracing, thus ``perf_aux_output_end()`` updates the
74541397152SLeo Yan  header pointer ``perf_buffer::aux_head`` to reflect the latest buffer usage.
74641397152SLeo Yan
74741397152SLeo YanAt the end, the PMU driver will restart hardware tracing.  During this
74841397152SLeo Yantemporary suspending period, it will lose hardware trace data, which
74941397152SLeo Yanwill introduce a discontinuity during decoding phase.
75041397152SLeo Yan
75141397152SLeo YanThe event ``PERF_RECORD_AUX`` presents an AUX event which is handled in the
75241397152SLeo Yankernel, but it lacks the information for saving the AUX trace data in
75341397152SLeo Yanthe perf file.  When the perf tool copies the trace data from AUX ring
75441397152SLeo Yanbuffer to the perf data file, it synthesizes a ``PERF_RECORD_AUXTRACE``
75541397152SLeo Yanevent which is not a kernel ABI, it's defined by the perf tool to describe
75641397152SLeo Yanwhich portion of data in the AUX ring buffer is saved.  Afterwards, the perf
75741397152SLeo Yantool reads out the AUX trace data from the perf file based on the
75841397152SLeo Yan``PERF_RECORD_AUXTRACE`` events, and the ``PERF_RECORD_AUX`` event is used to
75941397152SLeo Yandecode a chunk of data by correlating with time order.
76041397152SLeo Yan
76141397152SLeo Yan3.3 Snapshot mode
76241397152SLeo Yan-----------------
76341397152SLeo Yan
76441397152SLeo YanPerf supports snapshot mode for AUX ring buffer, in this mode, users
76541397152SLeo Yanonly record AUX trace data at a specific time point which users are
76641397152SLeo Yaninterested in.  E.g. below gives an example of how to take snapshots
76741397152SLeo Yanwith 1 second interval with Arm CoreSight::
76841397152SLeo Yan
769*4a29fa26SJames Clark  perf record -e cs_etm//u -S -a program &
77041397152SLeo Yan  PERFPID=$!
77141397152SLeo Yan  while true; do
77241397152SLeo Yan      kill -USR2 $PERFPID
77341397152SLeo Yan      sleep 1
77441397152SLeo Yan  done
77541397152SLeo Yan
77641397152SLeo YanThe main flow for snapshot mode is:
77741397152SLeo Yan
77841397152SLeo Yan- Before a snapshot is taken, the AUX ring buffer acts in free run mode.
77941397152SLeo Yan  During free run mode the perf doesn't record any of the AUX events and
78041397152SLeo Yan  trace data;
78141397152SLeo Yan
78241397152SLeo Yan- Once the perf tool receives the *USR2* signal, it triggers the callback
78341397152SLeo Yan  function ``auxtrace_record::snapshot_start()`` to deactivate hardware
78441397152SLeo Yan  tracing.  The kernel driver then populates the AUX ring buffer with the
78541397152SLeo Yan  hardware trace data, and the event ``PERF_RECORD_AUX`` is stored in the
78641397152SLeo Yan  regular ring buffer;
78741397152SLeo Yan
78841397152SLeo Yan- Then perf tool takes a snapshot, ``record__read_auxtrace_snapshot()``
78941397152SLeo Yan  reads out the hardware trace data from the AUX ring buffer and saves it
79041397152SLeo Yan  into perf data file;
79141397152SLeo Yan
79241397152SLeo Yan- After the snapshot is finished, ``auxtrace_record::snapshot_finish()``
79341397152SLeo Yan  restarts the PMU event for AUX tracing.
79441397152SLeo Yan
79541397152SLeo YanThe perf only accesses the head pointer ``perf_event_mmap_page::aux_head``
79641397152SLeo Yanin snapshot mode and doesn’t touch tail pointer ``aux_tail``, this is
79741397152SLeo Yanbecause the AUX ring buffer can overflow in free run mode, the tail
79841397152SLeo Yanpointer is useless in this case.  Alternatively, the callback
79941397152SLeo Yan``auxtrace_record::find_snapshot()`` is introduced for making the decision
80041397152SLeo Yanof whether the AUX ring buffer has been wrapped around or not, at the
80141397152SLeo Yanend it fixes up the AUX buffer's head which are used to calculate the
80241397152SLeo Yantrace data size.
80341397152SLeo Yan
80441397152SLeo YanAs we know, the buffers' deployment can be per-thread mode, per-CPU
80541397152SLeo Yanmode, or system wide mode, and the snapshot can be applied to any of
80641397152SLeo Yanthese modes.  Below is an example of taking snapshot with system wide
80741397152SLeo Yanmode.
80841397152SLeo Yan
80941397152SLeo Yan::
81041397152SLeo Yan
81141397152SLeo Yan                                         Snapshot is taken
81241397152SLeo Yan                                                 |
81341397152SLeo Yan                                                 v
81441397152SLeo Yan                        +------------------------+
81541397152SLeo Yan                        |  AUX Ring buffer 0     | <- aux_head
81641397152SLeo Yan                        +------------------------+
81741397152SLeo Yan                                                 v
81841397152SLeo Yan                +--------------------------------+
81941397152SLeo Yan                |          AUX Ring buffer 1     | <- aux_head
82041397152SLeo Yan                +--------------------------------+
82141397152SLeo Yan                                                 v
82241397152SLeo Yan    +--------------------------------------------+
82341397152SLeo Yan    |                      AUX Ring buffer 2     | <- aux_head
82441397152SLeo Yan    +--------------------------------------------+
82541397152SLeo Yan                                                 v
82641397152SLeo Yan         +---------------------------------------+
82741397152SLeo Yan         |                 AUX Ring buffer 3     | <- aux_head
82841397152SLeo Yan         +---------------------------------------+
82941397152SLeo Yan
83041397152SLeo Yan                Figure 9. Snapshot with system wide mode
831