1llvm-mca - LLVM Machine Code Analyzer
2=====================================
3
4.. program:: llvm-mca
5
6SYNOPSIS
7--------
8
9:program:`llvm-mca` [*options*] [input]
10
11DESCRIPTION
12-----------
13
14:program:`llvm-mca` is a performance analysis tool that uses information
15available in LLVM (e.g. scheduling models) to statically measure the performance
16of machine code in a specific CPU.
17
18Performance is measured in terms of throughput as well as processor resource
19consumption. The tool currently works for processors with a backend for which
20there is a scheduling model available in LLVM.
21
22The main goal of this tool is not just to predict the performance of the code
23when run on the target, but also help with diagnosing potential performance
24issues.
25
26Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
27Per Cycle (IPC), as well as hardware resource pressure. The analysis and
28reporting style were inspired by the IACA tool from Intel.
29
30For example, you can compile code with clang, output assembly, and pipe it
31directly into :program:`llvm-mca` for analysis:
32
33.. code-block:: bash
34
35  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
36
37Or for Intel syntax:
38
39.. code-block:: bash
40
41  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
42
43(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
44directive at the beginning of the input.  By default its output syntax matches
45that of its input.)
46
47Scheduling models are not just used to compute instruction latencies and
48throughput, but also to understand what processor resources are available
49and how to simulate them.
50
51By design, the quality of the analysis conducted by :program:`llvm-mca` is
52inevitably affected by the quality of the scheduling models in LLVM.
53
54If you see that the performance report is not accurate for a processor,
55please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_
56against the appropriate backend.
57
58OPTIONS
59-------
60
61If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
62input. Otherwise, it will read from the specified filename.
63
64If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
65to standard output if the input is from standard input.  If the :option:`-o`
66option specifies "``-``", then the output will also be sent to standard output.
67
68
69.. option:: -help
70
71 Print a summary of command line options.
72
73.. option:: -o <filename>
74
75 Use ``<filename>`` as the output filename. See the summary above for more
76 details.
77
78.. option:: -mtriple=<target triple>
79
80 Specify a target triple string.
81
82.. option:: -march=<arch>
83
84 Specify the architecture for which to analyze the code. It defaults to the
85 host default target.
86
87.. option:: -mcpu=<cpuname>
88
89  Specify the processor for which to analyze the code.  By default, the cpu name
90  is autodetected from the host.
91
92.. option:: -output-asm-variant=<variant id>
93
94 Specify the output assembly variant for the report generated by the tool.
95 On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
96 the AT&T (vic. Intel) assembly format for the code printed out by the tool in
97 the analysis report.
98
99.. option:: -print-imm-hex
100
101 Prefer hex format for numeric literals in the output assembly printed as part
102 of the report.
103
104.. option:: -dispatch=<width>
105
106 Specify a different dispatch width for the processor. The dispatch width
107 defaults to field 'IssueWidth' in the processor scheduling model.  If width is
108 zero, then the default dispatch width is used.
109
110.. option:: -register-file-size=<size>
111
112 Specify the size of the register file. When specified, this flag limits how
113 many physical registers are available for register renaming purposes. A value
114 of zero for this flag means "unlimited number of physical registers".
115
116.. option:: -iterations=<number of iterations>
117
118 Specify the number of iterations to run. If this flag is set to 0, then the
119 tool sets the number of iterations to a default value (i.e. 100).
120
121.. option:: -noalias=<bool>
122
123  If set, the tool assumes that loads and stores don't alias. This is the
124  default behavior.
125
126.. option:: -lqueue=<load queue size>
127
128  Specify the size of the load queue in the load/store unit emulated by the tool.
129  By default, the tool assumes an unbound number of entries in the load queue.
130  A value of zero for this flag is ignored, and the default load queue size is
131  used instead.
132
133.. option:: -squeue=<store queue size>
134
135  Specify the size of the store queue in the load/store unit emulated by the
136  tool. By default, the tool assumes an unbound number of entries in the store
137  queue. A value of zero for this flag is ignored, and the default store queue
138  size is used instead.
139
140.. option:: -timeline
141
142  Enable the timeline view.
143
144.. option:: -timeline-max-iterations=<iterations>
145
146  Limit the number of iterations to print in the timeline view. By default, the
147  timeline view prints information for up to 10 iterations.
148
149.. option:: -timeline-max-cycles=<cycles>
150
151  Limit the number of cycles in the timeline view, or use 0 for no limit. By
152  default, the number of cycles is set to 80.
153
154.. option:: -resource-pressure
155
156  Enable the resource pressure view. This is enabled by default.
157
158.. option:: -register-file-stats
159
160  Enable register file usage statistics.
161
162.. option:: -dispatch-stats
163
164  Enable extra dispatch statistics. This view collects and analyzes instruction
165  dispatch events, as well as static/dynamic dispatch stall events. This view
166  is disabled by default.
167
168.. option:: -scheduler-stats
169
170  Enable extra scheduler statistics. This view collects and analyzes instruction
171  issue events. This view is disabled by default.
172
173.. option:: -retire-stats
174
175  Enable extra retire control unit statistics. This view is disabled by default.
176
177.. option:: -instruction-info
178
179  Enable the instruction info view. This is enabled by default.
180
181.. option:: -show-encoding
182
183  Enable the printing of instruction encodings within the instruction info view.
184
185.. option:: -show-barriers
186
187  Enable the printing of LoadBarrier and StoreBarrier flags within the
188  instruction info view.
189
190.. option:: -all-stats
191
192  Print all hardware statistics. This enables extra statistics related to the
193  dispatch logic, the hardware schedulers, the register file(s), and the retire
194  control unit. This option is disabled by default.
195
196.. option:: -all-views
197
198  Enable all the view.
199
200.. option:: -instruction-tables
201
202  Prints resource pressure information based on the static information
203  available from the processor model. This differs from the resource pressure
204  view because it doesn't require that the code is simulated. It instead prints
205  the theoretical uniform distribution of resource pressure for every
206  instruction in sequence.
207
208.. option:: -bottleneck-analysis
209
210  Print information about bottlenecks that affect the throughput. This analysis
211  can be expensive, and it is disabled by default. Bottlenecks are highlighted
212  in the summary view. Bottleneck analysis is currently not supported for
213  processors with an in-order backend.
214
215.. option:: -json
216
217  Print the requested views in valid JSON format. The instructions and the
218  processor resources are printed as members of special top level JSON objects.
219  The individual views refer to them by index. However, not all views are
220  currently supported. For example, the report from the bottleneck analysis is
221  not printed out in JSON. All the default views are currently supported.
222
223.. option:: -disable-cb
224
225  Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
226  than using the target specific implementation. The generic classes never
227  detect any custom hazards or make any post processing modifications to
228  instructions.
229
230
231EXIT STATUS
232-----------
233
234:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
235to standard error, and the tool returns 1.
236
237USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
238---------------------------------------------
239:program:`llvm-mca` allows for the optional usage of special code comments to
240mark regions of the assembly code to be analyzed.  A comment starting with
241substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
242starting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
243example:
244
245.. code-block:: none
246
247  # LLVM-MCA-BEGIN
248    ...
249  # LLVM-MCA-END
250
251If no user-defined region is specified, then :program:`llvm-mca` assumes a
252default region which contains every instruction in the input file.  Every region
253is analyzed in isolation, and the final performance report is the union of all
254the reports generated for every code region.
255
256Code regions can have names. For example:
257
258.. code-block:: none
259
260  # LLVM-MCA-BEGIN A simple example
261    add %eax, %eax
262  # LLVM-MCA-END
263
264The code from the example above defines a region named "A simple example" with a
265single instruction in it. Note how the region name doesn't have to be repeated
266in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
267an anonymous ``LLVM-MCA-END`` directive always ends the currently active user
268defined region.
269
270Example of nesting regions:
271
272.. code-block:: none
273
274  # LLVM-MCA-BEGIN foo
275    add %eax, %edx
276  # LLVM-MCA-BEGIN bar
277    sub %eax, %edx
278  # LLVM-MCA-END bar
279  # LLVM-MCA-END foo
280
281Example of overlapping regions:
282
283.. code-block:: none
284
285  # LLVM-MCA-BEGIN foo
286    add %eax, %edx
287  # LLVM-MCA-BEGIN bar
288    sub %eax, %edx
289  # LLVM-MCA-END foo
290    add %eax, %edx
291  # LLVM-MCA-END bar
292
293Note that multiple anonymous regions cannot overlap. Also, overlapping regions
294cannot have the same name.
295
296There is no support for marking regions from high-level source code, like C or
297C++. As a workaround, inline assembly directives may be used:
298
299.. code-block:: c++
300
301  int foo(int a, int b) {
302    __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
303    a += 42;
304    __asm volatile("# LLVM-MCA-END":::"memory");
305    a *= b;
306    return a;
307  }
308
309However, this interferes with optimizations like loop vectorization and may have
310an impact on the code generated. This is because the ``__asm`` statements are
311seen as real code having important side effects, which limits how the code
312around them can be transformed. If users want to make use of inline assembly
313to emit markers, then the recommendation is to always verify that the output
314assembly is equivalent to the assembly generated in the absence of markers.
315The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
316can also help in detecting missed optimizations.
317
318HOW LLVM-MCA WORKS
319------------------
320
321:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
322into a sequence of MCInst with the help of the existing LLVM target assembly
323parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
324to generate a performance report.
325
326The Pipeline module simulates the execution of the machine code sequence in a
327loop of iterations (default is 100). During this process, the pipeline collects
328a number of execution related statistics. At the end of this process, the
329pipeline generates and prints a report from the collected statistics.
330
331Here is an example of a performance report generated by the tool for a
332dot-product of two packed float vectors of four elements. The analysis is
333conducted for target x86, cpu btver2.  The following result can be produced via
334the following command using the example located at
335``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
336
337.. code-block:: bash
338
339  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
340
341.. code-block:: none
342
343  Iterations:        300
344  Instructions:      900
345  Total Cycles:      610
346  Total uOps:        900
347
348  Dispatch Width:    2
349  uOps Per Cycle:    1.48
350  IPC:               1.48
351  Block RThroughput: 2.0
352
353
354  Instruction Info:
355  [1]: #uOps
356  [2]: Latency
357  [3]: RThroughput
358  [4]: MayLoad
359  [5]: MayStore
360  [6]: HasSideEffects (U)
361
362  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
363   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
364   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
365   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
366
367
368  Resources:
369  [0]   - JALU0
370  [1]   - JALU1
371  [2]   - JDiv
372  [3]   - JFPA
373  [4]   - JFPM
374  [5]   - JFPU0
375  [6]   - JFPU1
376  [7]   - JLAGU
377  [8]   - JMul
378  [9]   - JSAGU
379  [10]  - JSTC
380  [11]  - JVALU0
381  [12]  - JVALU1
382  [13]  - JVIMUL
383
384
385  Resource pressure per iteration:
386  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
387   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
388
389  Resource pressure by instruction:
390  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
391   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
392   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
393   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
394
395According to this report, the dot-product kernel has been executed 300 times,
396for a total of 900 simulated instructions. The total number of simulated micro
397opcodes (uOps) is also 900.
398
399The report is structured in three main sections.  The first section collects a
400few performance numbers; the goal of this section is to give a very quick
401overview of the performance throughput. Important performance indicators are
402**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
403Throughput).
404
405Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
406to the out-of-order backend every simulated cycle. For processors with an
407in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
408to the backend every simulated cycle.
409
410IPC is computed dividing the total number of simulated instructions by the total
411number of cycles.
412
413Field *Block RThroughput* is the reciprocal of the block throughput. Block
414throughput is a theoretical quantity computed as the maximum number of blocks
415(i.e. iterations) that can be executed per simulated clock cycle in the absence
416of loop carried dependencies. Block throughput is superiorly limited by the
417dispatch rate, and the availability of hardware resources.
418
419In the absence of loop-carried data dependencies, the observed IPC tends to a
420theoretical maximum which can be computed by dividing the number of instructions
421of a single iteration by the `Block RThroughput`.
422
423Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
424opcodes by the total number of cycles. A delta between Dispatch Width and this
425field is an indicator of a performance issue. In the absence of loop-carried
426data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
427maximum throughput which can be computed by dividing the number of uOps of a
428single iteration by the `Block RThroughput`.
429
430Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
431because the dispatch width limits the maximum size of a dispatch group. Both IPC
432and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
433availability of hardware resources affects the resource pressure distribution,
434and it limits the number of instructions that can be executed in parallel every
435cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
436Cycle (computed by dividing the number of uOps of a single iteration by the
437`Block RThroughput`) is an indicator of a performance bottleneck caused by the
438lack of hardware resources.
439In general, the lower the Block RThroughput, the better.
440
441In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
442are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
443approach 1.50 when the number of iterations tends to infinity. The delta between
444the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
445an indicator of a performance bottleneck caused by the lack of hardware
446resources, and the *Resource pressure view* can help to identify the problematic
447resource usage.
448
449The second section of the report is the `instruction info view`. It shows the
450latency and reciprocal throughput of every instruction in the sequence. It also
451reports extra information related to the number of micro opcodes, and opcode
452properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
453
454Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
455is computed as the maximum number of instructions of a same type that can be
456executed per clock cycle in the absence of operand dependencies. In this
457example, the reciprocal throughput of a vector float multiply is 1
458cycles/instruction.  That is because the FP multiplier JFPM is only available
459from pipeline JFPU1.
460
461Instruction encodings are displayed within the instruction info view when flag
462`-show-encoding` is specified.
463
464Below is an example of `-show-encoding` output for the dot-product kernel:
465
466.. code-block:: none
467
468  Instruction Info:
469  [1]: #uOps
470  [2]: Latency
471  [3]: RThroughput
472  [4]: MayLoad
473  [5]: MayStore
474  [6]: HasSideEffects (U)
475  [7]: Encoding Size
476
477  [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
478   1      2     1.00                         4     c5 f0 59 d0                   vmulps	%xmm0, %xmm1, %xmm2
479   1      4     1.00                         4     c5 eb 7c da                   vhaddps	%xmm2, %xmm2, %xmm3
480   1      4     1.00                         4     c5 e3 7c e3                   vhaddps	%xmm3, %xmm3, %xmm4
481
482The `Encoding Size` column shows the size in bytes of instructions.  The
483`Encodings` column shows the actual instruction encodings (byte sequences in
484hex).
485
486The third section is the *Resource pressure view*.  This view reports
487the average number of resource cycles consumed every iteration by instructions
488for every processor resource unit available on the target.  Information is
489structured in two tables. The first table reports the number of resource cycles
490spent on average every iteration. The second table correlates the resource
491cycles to the machine instruction in the sequence. For example, every iteration
492of the instruction vmulps always executes on resource unit [6]
493(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
494per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
495only be issued to pipeline JFPU1, while horizontal floating-point additions can
496only be issued to pipeline JFPU0.
497
498The resource pressure view helps with identifying bottlenecks caused by high
499usage of specific hardware resources.  Situations with resource pressure mainly
500concentrated on a few resources should, in general, be avoided.  Ideally,
501pressure should be uniformly distributed between multiple resources.
502
503Timeline View
504^^^^^^^^^^^^^
505The timeline view produces a detailed report of each instruction's state
506transitions through an instruction pipeline.  This view is enabled by the
507command line option ``-timeline``.  As instructions transition through the
508various stages of the pipeline, their states are depicted in the view report.
509These states are represented by the following characters:
510
511* D : Instruction dispatched.
512* e : Instruction executing.
513* E : Instruction executed.
514* R : Instruction retired.
515* = : Instruction already dispatched, waiting to be executed.
516* \- : Instruction executed, waiting to be retired.
517
518Below is the timeline view for a subset of the dot-product example located in
519``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
520:program:`llvm-mca` using the following command:
521
522.. code-block:: bash
523
524  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
525
526.. code-block:: none
527
528  Timeline view:
529                      012345
530  Index     0123456789
531
532  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
533  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
534  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
535  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
536  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
537  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
538  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
539  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
540  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
541
542
543  Average Wait times (based on the timeline view):
544  [0]: Executions
545  [1]: Average time spent waiting in a scheduler's queue
546  [2]: Average time spent waiting in a scheduler's queue while ready
547  [3]: Average time elapsed from WB until retire stage
548
549        [0]    [1]    [2]    [3]
550  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
551  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
552  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
553         3     3.3    0.5    1.4       <total>
554
555The timeline view is interesting because it shows instruction state changes
556during execution.  It also gives an idea of how the tool processes instructions
557executed on the target, and how their timing information might be calculated.
558
559The timeline view is structured in two tables.  The first table shows
560instructions changing state over time (measured in cycles); the second table
561(named *Average Wait times*) reports useful timing statistics, which should
562help diagnose performance bottlenecks caused by long data dependencies and
563sub-optimal usage of hardware resources.
564
565An instruction in the timeline view is identified by a pair of indices, where
566the first index identifies an iteration, and the second index is the
567instruction index (i.e., where it appears in the code sequence).  Since this
568example was generated using 3 iterations: ``-iterations=3``, the iteration
569indices range from 0-2 inclusively.
570
571Excluding the first and last column, the remaining columns are in cycles.
572Cycles are numbered sequentially starting from 0.
573
574From the example output above, we know the following:
575
576* Instruction [1,0] was dispatched at cycle 1.
577* Instruction [1,0] started executing at cycle 2.
578* Instruction [1,0] reached the write back stage at cycle 4.
579* Instruction [1,0] was retired at cycle 10.
580
581Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
582scheduler's queue for the operands to become available. By the time vmulps is
583dispatched, operands are already available, and pipeline JFPU1 is ready to
584serve another instruction.  So the instruction can be immediately issued on the
585JFPU1 pipeline. That is demonstrated by the fact that the instruction only
586spent 1cy in the scheduler's queue.
587
588There is a gap of 5 cycles between the write-back stage and the retire event.
589That is because instructions must retire in program order, so [1,0] has to wait
590for [0,2] to be retired first (i.e., it has to wait until cycle 10).
591
592In the example, all instructions are in a RAW (Read After Write) dependency
593chain.  Register %xmm2 written by vmulps is immediately used by the first
594vhaddps, and register %xmm3 written by the first vhaddps is used by the second
595vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
596Parallelism).
597
598In the dot-product example, there are anti-dependencies introduced by
599instructions from different iterations.  However, those dependencies can be
600removed at register renaming stage (at the cost of allocating register aliases,
601and therefore consuming physical registers).
602
603Table *Average Wait times* helps diagnose performance issues that are caused by
604the presence of long latency instructions and potentially long data dependencies
605which may limit the ILP. Last row, ``<total>``, shows a global average over all
606instructions measured. Note that :program:`llvm-mca`, by default, assumes at
607least 1cy between the dispatch event and the issue event.
608
609When the performance is limited by data dependencies and/or long latency
610instructions, the number of cycles spent while in the *ready* state is expected
611to be very small when compared with the total number of cycles spent in the
612scheduler's queue.  The difference between the two counters is a good indicator
613of how large of an impact data dependencies had on the execution of the
614instructions.  When performance is mostly limited by the lack of hardware
615resources, the delta between the two counters is small.  However, the number of
616cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
617especially when compared to other low latency instructions.
618
619Bottleneck Analysis
620^^^^^^^^^^^^^^^^^^^
621The ``-bottleneck-analysis`` command line option enables the analysis of
622performance bottlenecks.
623
624This analysis is potentially expensive. It attempts to correlate increases in
625backend pressure (caused by pipeline resource pressure and data dependencies) to
626dynamic dispatch stalls.
627
628Below is an example of ``-bottleneck-analysis`` output generated by
629:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
630
631.. code-block:: none
632
633
634  Cycles with backend pressure increase [ 48.07% ]
635  Throughput Bottlenecks:
636    Resource Pressure       [ 47.77% ]
637    - JFPA  [ 47.77% ]
638    - JFPU0  [ 47.77% ]
639    Data Dependencies:      [ 0.30% ]
640    - Register Dependencies [ 0.30% ]
641    - Memory Dependencies   [ 0.00% ]
642
643  Critical sequence based on the simulation:
644
645                Instruction                         Dependency Information
646   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
647   |
648   |    < loop carried >
649   |
650   |      0.    vmulps  %xmm0, %xmm1, %xmm2
651   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
652   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
653   |
654   |    < loop carried >
655   |
656   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
657
658
659According to the analysis, throughput is limited by resource pressure and not by
660data dependencies.  The analysis observed increases in backend pressure during
66148.07% of the simulated run. Almost all those pressure increase events were
662caused by contention on processor resources JFPA/JFPU0.
663
664The `critical sequence` is the most expensive sequence of instructions according
665to the simulation. It is annotated to provide extra information about critical
666register dependencies and resource interferences between instructions.
667
668Instructions from the critical sequence are expected to significantly impact
669performance. By construction, the accuracy of this analysis is strongly
670dependent on the simulation and (as always) by the quality of the processor
671model in llvm.
672
673Bottleneck analysis is currently not supported for processors with an in-order
674backend.
675
676Extra Statistics to Further Diagnose Performance Issues
677^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
678The ``-all-stats`` command line option enables extra statistics and performance
679counters for the dispatch logic, the reorder buffer, the retire control unit,
680and the register file.
681
682Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
683for 300 iterations of the dot-product example discussed in the previous
684sections.
685
686.. code-block:: none
687
688  Dynamic Dispatch Stall Cycles:
689  RAT     - Register unavailable:                      0
690  RCU     - Retire tokens unavailable:                 0
691  SCHEDQ  - Scheduler full:                            272  (44.6%)
692  LQ      - Load queue full:                           0
693  SQ      - Store queue full:                          0
694  GROUP   - Static restrictions on the dispatch group: 0
695
696
697  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
698  [# dispatched], [# cycles]
699   0,              24  (3.9%)
700   1,              272  (44.6%)
701   2,              314  (51.5%)
702
703
704  Schedulers - number of cycles where we saw N micro opcodes issued:
705  [# issued], [# cycles]
706   0,          7  (1.1%)
707   1,          306  (50.2%)
708   2,          297  (48.7%)
709
710  Scheduler's queue usage:
711  [1] Resource name.
712  [2] Average number of used buffer entries.
713  [3] Maximum number of used buffer entries.
714  [4] Total number of buffer entries.
715
716   [1]            [2]        [3]        [4]
717  JALU01           0          0          20
718  JFPU01           17         18         18
719  JLSAGU           0          0          12
720
721
722  Retire Control Unit - number of cycles where we saw N instructions retired:
723  [# retired], [# cycles]
724   0,           109  (17.9%)
725   1,           102  (16.7%)
726   2,           399  (65.4%)
727
728  Total ROB Entries:                64
729  Max Used ROB Entries:             35  ( 54.7% )
730  Average Used ROB Entries per cy:  32  ( 50.0% )
731
732
733  Register File statistics:
734  Total number of mappings created:    900
735  Max number of mappings used:         35
736
737  *  Register File #1 -- JFpuPRF:
738     Number of physical registers:     72
739     Total number of mappings created: 900
740     Max number of mappings used:      35
741
742  *  Register File #2 -- JIntegerPRF:
743     Number of physical registers:     64
744     Total number of mappings created: 0
745     Max number of mappings used:      0
746
747If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
748SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
749logic is unable to dispatch a full group because the scheduler's queue is full.
750
751Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
752dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
753one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
754dispatch statistics are displayed by either using the command option
755``-all-stats`` or ``-dispatch-stats``.
756
757The next table, *Schedulers*, presents a histogram displaying a count,
758representing the number of micro opcodes issued on some number of cycles. In
759this case, of the 610 simulated cycles, single opcodes were issued 306 times
760(50.2%) and there were 7 cycles where no opcodes were issued.
761
762The *Scheduler's queue usage* table shows that the average and maximum number of
763buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
764reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
765three schedulers:
766
767* JALU01 - A scheduler for ALU instructions.
768* JFPU01 - A scheduler floating point operations.
769* JLSAGU - A scheduler for address generation.
770
771The dot-product is a kernel of three floating point instructions (a vector
772multiply followed by two horizontal adds).  That explains why only the floating
773point scheduler appears to be used.
774
775A full scheduler queue is either caused by data dependency chains or by a
776sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
777mitigated by rewriting the kernel using different instructions that consume
778different scheduler resources.  Schedulers with a small queue are less resilient
779to bottlenecks caused by the presence of long data dependencies.  The scheduler
780statistics are displayed by using the command option ``-all-stats`` or
781``-scheduler-stats``.
782
783The next table, *Retire Control Unit*, presents a histogram displaying a count,
784representing the number of instructions retired on some number of cycles.  In
785this case, of the 610 simulated cycles, two instructions were retired during the
786same cycle 399 times (65.4%) and there were 109 cycles where no instructions
787were retired.  The retire statistics are displayed by using the command option
788``-all-stats`` or ``-retire-stats``.
789
790The last table presented is *Register File statistics*.  Each physical register
791file (PRF) used by the pipeline is presented in this table.  In the case of AMD
792Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
793and one for integer registers (JIntegerPRF).  The table shows that of the 900
794instructions processed, there were 900 mappings created.  Since this dot-product
795example utilized only floating point registers, the JFPuPRF was responsible for
796creating the 900 mappings.  However, we see that the pipeline only used a
797maximum of 35 of 72 available register slots at any given time. We can conclude
798that the floating point PRF was the only register file used for the example, and
799that it was never resource constrained.  The register file statistics are
800displayed by using the command option ``-all-stats`` or
801``-register-file-stats``.
802
803In this example, we can conclude that the IPC is mostly limited by data
804dependencies, and not by resource pressure.
805
806Instruction Flow
807^^^^^^^^^^^^^^^^
808This section describes the instruction flow through the default pipeline of
809:program:`llvm-mca`, as well as the functional units involved in the process.
810
811The default pipeline implements the following sequence of stages used to
812process instructions.
813
814* Dispatch (Instruction is dispatched to the schedulers).
815* Issue (Instruction is issued to the processor pipelines).
816* Write Back (Instruction is executed, and results are written back).
817* Retire (Instruction is retired; writes are architecturally committed).
818
819The in-order pipeline implements the following sequence of stages:
820* InOrderIssue (Instruction is issued to the processor pipelines).
821* Retire (Instruction is retired; writes are architecturally committed).
822
823:program:`llvm-mca` assumes that instructions have all been decoded and placed
824into a queue before the simulation start. Therefore, the instruction fetch and
825decode stages are not modeled. Performance bottlenecks in the frontend are not
826diagnosed. Also, :program:`llvm-mca` does not model branch prediction.
827
828Instruction Dispatch
829""""""""""""""""""""
830During the dispatch stage, instructions are picked in program order from a
831queue of already decoded instructions, and dispatched in groups to the
832simulated hardware schedulers.
833
834The size of a dispatch group depends on the availability of the simulated
835hardware resources.  The processor dispatch width defaults to the value
836of the ``IssueWidth`` in LLVM's scheduling model.
837
838An instruction can be dispatched if:
839
840* The size of the dispatch group is smaller than processor's dispatch width.
841* There are enough entries in the reorder buffer.
842* There are enough physical registers to do register renaming.
843* The schedulers are not full.
844
845Scheduling models can optionally specify which register files are available on
846the processor. :program:`llvm-mca` uses that information to initialize register
847file descriptors.  Users can limit the number of physical registers that are
848globally available for register renaming by using the command option
849``-register-file-size``.  A value of zero for this option means *unbounded*. By
850knowing how many registers are available for renaming, the tool can predict
851dispatch stalls caused by the lack of physical registers.
852
853The number of reorder buffer entries consumed by an instruction depends on the
854number of micro-opcodes specified for that instruction by the target scheduling
855model.  The reorder buffer is responsible for tracking the progress of
856instructions that are "in-flight", and retiring them in program order.  The
857number of entries in the reorder buffer defaults to the value specified by field
858`MicroOpBufferSize` in the target scheduling model.
859
860Instructions that are dispatched to the schedulers consume scheduler buffer
861entries. :program:`llvm-mca` queries the scheduling model to determine the set
862of buffered resources consumed by an instruction.  Buffered resources are
863treated like scheduler resources.
864
865Instruction Issue
866"""""""""""""""""
867Each processor scheduler implements a buffer of instructions.  An instruction
868has to wait in the scheduler's buffer until input register operands become
869available.  Only at that point, does the instruction becomes eligible for
870execution and may be issued (potentially out-of-order) for execution.
871Instruction latencies are computed by :program:`llvm-mca` with the help of the
872scheduling model.
873
874:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
875schedulers.  The scheduler is responsible for tracking data dependencies, and
876dynamically selecting which processor resources are consumed by instructions.
877It delegates the management of processor resource units and resource groups to a
878resource manager.  The resource manager is responsible for selecting resource
879units that are consumed by instructions.  For example, if an instruction
880consumes 1cy of a resource group, the resource manager selects one of the
881available units from the group; by default, the resource manager uses a
882round-robin selector to guarantee that resource usage is uniformly distributed
883between all units of a group.
884
885:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
886
887* WaitSet: a set of instructions whose operands are not ready.
888* ReadySet: a set of instructions ready to execute.
889* IssuedSet: a set of instructions executing.
890
891Depending on the operands availability, instructions that are dispatched to the
892scheduler are either placed into the WaitSet or into the ReadySet.
893
894Every cycle, the scheduler checks if instructions can be moved from the WaitSet
895to the ReadySet, and if instructions from the ReadySet can be issued to the
896underlying pipelines. The algorithm prioritizes older instructions over younger
897instructions.
898
899Write-Back and Retire Stage
900"""""""""""""""""""""""""""
901Issued instructions are moved from the ReadySet to the IssuedSet.  There,
902instructions wait until they reach the write-back stage.  At that point, they
903get removed from the queue and the retire control unit is notified.
904
905When instructions are executed, the retire control unit flags the instruction as
906"ready to retire."
907
908Instructions are retired in program order.  The register file is notified of the
909retirement so that it can free the physical registers that were allocated for
910the instruction during the register renaming stage.
911
912Load/Store Unit and Memory Consistency Model
913""""""""""""""""""""""""""""""""""""""""""""
914To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
915utilizes a simulated load/store unit (LSUnit) to simulate the speculative
916execution of loads and stores.
917
918Each load (or store) consumes an entry in the load (or store) queue. Users can
919specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
920load and store queues respectively. The queues are unbounded by default.
921
922The LSUnit implements a relaxed consistency model for memory loads and stores.
923The rules are:
924
9251. A younger load is allowed to pass an older load only if there are no
926   intervening stores or barriers between the two loads.
9272. A younger load is allowed to pass an older store provided that the load does
928   not alias with the store.
9293. A younger store is not allowed to pass an older store.
9304. A younger store is not allowed to pass an older load.
931
932By default, the LSUnit optimistically assumes that loads do not alias
933(`-noalias=true`) store operations.  Under this assumption, younger loads are
934always allowed to pass older stores.  Essentially, the LSUnit does not attempt
935to run any alias analysis to predict when loads and stores do not alias with
936each other.
937
938Note that, in the case of write-combining memory, rule 3 could be relaxed to
939allow reordering of non-aliasing store operations.  That being said, at the
940moment, there is no way to further relax the memory model (``-noalias`` is the
941only option).  Essentially, there is no option to specify a different memory
942type (e.g., write-back, write-combining, write-through; etc.) and consequently
943to weaken, or strengthen, the memory model.
944
945Other limitations are:
946
947* The LSUnit does not know when store-to-load forwarding may occur.
948* The LSUnit does not know anything about cache hierarchy and memory types.
949* The LSUnit does not know how to identify serializing operations and memory
950  fences.
951
952The LSUnit does not attempt to predict if a load or store hits or misses the L1
953cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
954loads, the scheduling model provides an "optimistic" load-to-use latency (which
955usually matches the load-to-use latency for when there is a hit in the L1D).
956
957:program:`llvm-mca` does not (on its own) know about serializing operations or
958memory-barrier like instructions.  The LSUnit used to conservatively use an
959instruction's "MayLoad", "MayStore", and unmodeled side effects flags to
960determine whether an instruction should be treated as a memory-barrier. This was
961inaccurate in general and was changed so that now each instruction has an
962IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and
963default to false for every instruction. If any instruction should have either of
964these flags set, it should be done within the target's InstrPostProcess class.
965For an example, look at the `X86InstrPostProcess::postProcessInstruction` method
966within `llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp`.
967
968A load/store barrier consumes one entry of the load/store queue.  A load/store
969barrier enforces ordering of loads/stores.  A younger load cannot pass a load
970barrier.  Also, a younger store cannot pass a store barrier.  A younger load
971has to wait for the memory/load barrier to execute.  A load/store barrier is
972"executed" when it becomes the oldest entry in the load/store queue(s). That
973also means, by construction, all of the older loads/stores have been executed.
974
975In conclusion, the full set of load/store consistency rules are:
976
977#. A store may not pass a previous store.
978#. A store may not pass a previous load (regardless of ``-noalias``).
979#. A store has to wait until an older store barrier is fully executed.
980#. A load may pass a previous load.
981#. A load may not pass a previous store unless ``-noalias`` is set.
982#. A load has to wait until an older load barrier is fully executed.
983
984In-order Issue and Execute
985""""""""""""""""""""""""""""""""""""
986In-order processors are modelled as a single ``InOrderIssueStage`` stage. It
987bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
988soon as their operand registers are available and resource requirements are
989met. Multiple instructions can be issued in one cycle according to the value of
990the ``IssueWidth`` parameter in LLVM's scheduling model.
991
992Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to
993retire. :program:`llvm-mca` ensures that writes are committed in-order. However,
994an instruction is allowed to commit writes and retire out-of-order if
995``RetireOOO`` property is true for at least one of its writes.
996
997Custom Behaviour
998""""""""""""""""""""""""""""""""""""
999Due to certain instructions not being expressed perfectly within their
1000scheduling model, :program:`llvm-mca` isn't always able to simulate them
1001perfectly. Modifying the scheduling model isn't always a viable
1002option though (maybe because the instruction is modeled incorrectly on
1003purpose or the instruction's behaviour is quite complex). The
1004CustomBehaviour class can be used in these cases to enforce proper
1005instruction modeling (often by customizing data dependencies and detecting
1006hazards that :program:`llvm-mca` has no way of knowing about).
1007
1008:program:`llvm-mca` comes with one generic and multiple target specific
1009CustomBehaviour classes. The generic class will be used if the ``-disable-cb``
1010flag is used or if a target specific CustomBehaviour class doesn't exist for
1011that target. (The generic class does nothing.) Currently, the CustomBehaviour
1012class is only a part of the in-order pipeline, but there are plans to add it
1013to the out-of-order pipeline in the future.
1014
1015CustomBehaviour's main method is `checkCustomHazard()` which uses the
1016current instruction and a list of all instructions still executing within
1017the pipeline to determine if the current instruction should be dispatched.
1018As output, the method returns an integer representing the number of cycles
1019that the current instruction must stall for (this can be an underestimate
1020if you don't know the exact number and a value of 0 represents no stall).
1021
1022If you'd like to add a CustomBehaviour class for a target that doesn't
1023already have one, refer to an existing implementation to see how to set it
1024up. The classes are implemented within the target specific backend (for
1025example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols.
1026
1027Custom Views
1028""""""""""""""""""""""""""""""""""""
1029:program:`llvm-mca` comes with several Views such as the Timeline View and
1030Summary View. These Views are generic and can work with most (if not all)
1031targets. If you wish to add a new View to :program:`llvm-mca` and it does not
1032require any backend functionality that is not already exposed through MC layer
1033classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
1034`/tools/llvm-mca/View/` directory. However, if your new View is target specific
1035AND requires unexposed backend symbols or functionality, you can define it in
1036the `/lib/Target/<TargetName>/MCA/` directory.
1037
1038To enable this target specific View, you will have to use this target's
1039CustomBehaviour class to override the `CustomBehaviour::getViews()` methods.
1040There are 3 variations of these methods based on where you want your View to
1041appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and
1042`getEndViews()`. These methods returns a vector of Views so you will want to
1043return a vector containing all of the target specific Views for the target in
1044question.
1045
1046Because these target specific (and backend dependent) Views require the
1047`CustomBehaviour::getViews()` variants, these Views will not be enabled if
1048the `-disable-cb` flag is used.
1049
1050Enabling these custom Views does not affect the non-custom (generic) Views.
1051Continue to use the usual command line arguments to enable / disable those
1052Views.
1053