1llvm-mca - LLVM Machine Code Analyzer
2=====================================
3
4.. program:: llvm-mca
5
6SYNOPSIS
7--------
8
9:program:`llvm-mca` [*options*] [input]
10
11DESCRIPTION
12-----------
13
14:program:`llvm-mca` is a performance analysis tool that uses information
15available in LLVM (e.g. scheduling models) to statically measure the performance
16of machine code in a specific CPU.
17
18Performance is measured in terms of throughput as well as processor resource
19consumption. The tool currently works for processors with a backend for which
20there is a scheduling model available in LLVM.
21
22The main goal of this tool is not just to predict the performance of the code
23when run on the target, but also help with diagnosing potential performance
24issues.
25
26Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
27Per Cycle (IPC), as well as hardware resource pressure. The analysis and
28reporting style were inspired by the IACA tool from Intel.
29
30For example, you can compile code with clang, output assembly, and pipe it
31directly into :program:`llvm-mca` for analysis:
32
33.. code-block:: bash
34
35  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
36
37Or for Intel syntax:
38
39.. code-block:: bash
40
41  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
42
43(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
44directive at the beginning of the input.  By default its output syntax matches
45that of its input.)
46
47Scheduling models are not just used to compute instruction latencies and
48throughput, but also to understand what processor resources are available
49and how to simulate them.
50
51By design, the quality of the analysis conducted by :program:`llvm-mca` is
52inevitably affected by the quality of the scheduling models in LLVM.
53
54If you see that the performance report is not accurate for a processor,
55please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_
56against the appropriate backend.
57
58OPTIONS
59-------
60
61If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
62input. Otherwise, it will read from the specified filename.
63
64If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
65to standard output if the input is from standard input.  If the :option:`-o`
66option specifies "``-``", then the output will also be sent to standard output.
67
68
69.. option:: -help
70
71 Print a summary of command line options.
72
73.. option:: -o <filename>
74
75 Use ``<filename>`` as the output filename. See the summary above for more
76 details.
77
78.. option:: -mtriple=<target triple>
79
80 Specify a target triple string.
81
82.. option:: -march=<arch>
83
84 Specify the architecture for which to analyze the code. It defaults to the
85 host default target.
86
87.. option:: -mcpu=<cpuname>
88
89  Specify the processor for which to analyze the code.  By default, the cpu name
90  is autodetected from the host.
91
92.. option:: -output-asm-variant=<variant id>
93
94 Specify the output assembly variant for the report generated by the tool.
95 On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
96 the AT&T (vic. Intel) assembly format for the code printed out by the tool in
97 the analysis report.
98
99.. option:: -print-imm-hex
100
101 Prefer hex format for numeric literals in the output assembly printed as part
102 of the report.
103
104.. option:: -dispatch=<width>
105
106 Specify a different dispatch width for the processor. The dispatch width
107 defaults to field 'IssueWidth' in the processor scheduling model.  If width is
108 zero, then the default dispatch width is used.
109
110.. option:: -register-file-size=<size>
111
112 Specify the size of the register file. When specified, this flag limits how
113 many physical registers are available for register renaming purposes. A value
114 of zero for this flag means "unlimited number of physical registers".
115
116.. option:: -iterations=<number of iterations>
117
118 Specify the number of iterations to run. If this flag is set to 0, then the
119 tool sets the number of iterations to a default value (i.e. 100).
120
121.. option:: -noalias=<bool>
122
123  If set, the tool assumes that loads and stores don't alias. This is the
124  default behavior.
125
126.. option:: -lqueue=<load queue size>
127
128  Specify the size of the load queue in the load/store unit emulated by the tool.
129  By default, the tool assumes an unbound number of entries in the load queue.
130  A value of zero for this flag is ignored, and the default load queue size is
131  used instead.
132
133.. option:: -squeue=<store queue size>
134
135  Specify the size of the store queue in the load/store unit emulated by the
136  tool. By default, the tool assumes an unbound number of entries in the store
137  queue. A value of zero for this flag is ignored, and the default store queue
138  size is used instead.
139
140.. option:: -timeline
141
142  Enable the timeline view.
143
144.. option:: -timeline-max-iterations=<iterations>
145
146  Limit the number of iterations to print in the timeline view. By default, the
147  timeline view prints information for up to 10 iterations.
148
149.. option:: -timeline-max-cycles=<cycles>
150
151  Limit the number of cycles in the timeline view, or use 0 for no limit. By
152  default, the number of cycles is set to 80.
153
154.. option:: -resource-pressure
155
156  Enable the resource pressure view. This is enabled by default.
157
158.. option:: -register-file-stats
159
160  Enable register file usage statistics.
161
162.. option:: -dispatch-stats
163
164  Enable extra dispatch statistics. This view collects and analyzes instruction
165  dispatch events, as well as static/dynamic dispatch stall events. This view
166  is disabled by default.
167
168.. option:: -scheduler-stats
169
170  Enable extra scheduler statistics. This view collects and analyzes instruction
171  issue events. This view is disabled by default.
172
173.. option:: -retire-stats
174
175  Enable extra retire control unit statistics. This view is disabled by default.
176
177.. option:: -instruction-info
178
179  Enable the instruction info view. This is enabled by default.
180
181.. option:: -show-encoding
182
183  Enable the printing of instruction encodings within the instruction info view.
184
185.. option:: -all-stats
186
187  Print all hardware statistics. This enables extra statistics related to the
188  dispatch logic, the hardware schedulers, the register file(s), and the retire
189  control unit. This option is disabled by default.
190
191.. option:: -all-views
192
193  Enable all the view.
194
195.. option:: -instruction-tables
196
197  Prints resource pressure information based on the static information
198  available from the processor model. This differs from the resource pressure
199  view because it doesn't require that the code is simulated. It instead prints
200  the theoretical uniform distribution of resource pressure for every
201  instruction in sequence.
202
203.. option:: -bottleneck-analysis
204
205  Print information about bottlenecks that affect the throughput. This analysis
206  can be expensive, and it is disabled by default. Bottlenecks are highlighted
207  in the summary view. Bottleneck analysis is currently not supported for
208  processors with an in-order backend.
209
210.. option:: -json
211
212  Print the requested views in valid JSON format. The instructions and the
213  processor resources are printed as members of special top level JSON objects.
214  The individual views refer to them by index. However, not all views are
215  currently supported. For example, the report from the bottleneck analysis is
216  not printed out in JSON. All the default views are currently supported.
217
218.. option:: -disable-cb
219
220  Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
221  than using the target specific implementation. The generic classes never
222  detect any custom hazards or make any post processing modifications to
223  instructions.
224
225
226EXIT STATUS
227-----------
228
229:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
230to standard error, and the tool returns 1.
231
232USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
233---------------------------------------------
234:program:`llvm-mca` allows for the optional usage of special code comments to
235mark regions of the assembly code to be analyzed.  A comment starting with
236substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
237starting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
238example:
239
240.. code-block:: none
241
242  # LLVM-MCA-BEGIN
243    ...
244  # LLVM-MCA-END
245
246If no user-defined region is specified, then :program:`llvm-mca` assumes a
247default region which contains every instruction in the input file.  Every region
248is analyzed in isolation, and the final performance report is the union of all
249the reports generated for every code region.
250
251Code regions can have names. For example:
252
253.. code-block:: none
254
255  # LLVM-MCA-BEGIN A simple example
256    add %eax, %eax
257  # LLVM-MCA-END
258
259The code from the example above defines a region named "A simple example" with a
260single instruction in it. Note how the region name doesn't have to be repeated
261in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
262an anonymous ``LLVM-MCA-END`` directive always ends the currently active user
263defined region.
264
265Example of nesting regions:
266
267.. code-block:: none
268
269  # LLVM-MCA-BEGIN foo
270    add %eax, %edx
271  # LLVM-MCA-BEGIN bar
272    sub %eax, %edx
273  # LLVM-MCA-END bar
274  # LLVM-MCA-END foo
275
276Example of overlapping regions:
277
278.. code-block:: none
279
280  # LLVM-MCA-BEGIN foo
281    add %eax, %edx
282  # LLVM-MCA-BEGIN bar
283    sub %eax, %edx
284  # LLVM-MCA-END foo
285    add %eax, %edx
286  # LLVM-MCA-END bar
287
288Note that multiple anonymous regions cannot overlap. Also, overlapping regions
289cannot have the same name.
290
291There is no support for marking regions from high-level source code, like C or
292C++. As a workaround, inline assembly directives may be used:
293
294.. code-block:: c++
295
296  int foo(int a, int b) {
297    __asm volatile("# LLVM-MCA-BEGIN foo");
298    a += 42;
299    __asm volatile("# LLVM-MCA-END");
300    a *= b;
301    return a;
302  }
303
304However, this interferes with optimizations like loop vectorization and may have
305an impact on the code generated. This is because the ``__asm`` statements are
306seen as real code having important side effects, which limits how the code
307around them can be transformed. If users want to make use of inline assembly
308to emit markers, then the recommendation is to always verify that the output
309assembly is equivalent to the assembly generated in the absence of markers.
310The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
311can also help in detecting missed optimizations.
312
313HOW LLVM-MCA WORKS
314------------------
315
316:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
317into a sequence of MCInst with the help of the existing LLVM target assembly
318parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
319to generate a performance report.
320
321The Pipeline module simulates the execution of the machine code sequence in a
322loop of iterations (default is 100). During this process, the pipeline collects
323a number of execution related statistics. At the end of this process, the
324pipeline generates and prints a report from the collected statistics.
325
326Here is an example of a performance report generated by the tool for a
327dot-product of two packed float vectors of four elements. The analysis is
328conducted for target x86, cpu btver2.  The following result can be produced via
329the following command using the example located at
330``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
331
332.. code-block:: bash
333
334  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
335
336.. code-block:: none
337
338  Iterations:        300
339  Instructions:      900
340  Total Cycles:      610
341  Total uOps:        900
342
343  Dispatch Width:    2
344  uOps Per Cycle:    1.48
345  IPC:               1.48
346  Block RThroughput: 2.0
347
348
349  Instruction Info:
350  [1]: #uOps
351  [2]: Latency
352  [3]: RThroughput
353  [4]: MayLoad
354  [5]: MayStore
355  [6]: HasSideEffects (U)
356
357  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
358   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
359   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
360   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
361
362
363  Resources:
364  [0]   - JALU0
365  [1]   - JALU1
366  [2]   - JDiv
367  [3]   - JFPA
368  [4]   - JFPM
369  [5]   - JFPU0
370  [6]   - JFPU1
371  [7]   - JLAGU
372  [8]   - JMul
373  [9]   - JSAGU
374  [10]  - JSTC
375  [11]  - JVALU0
376  [12]  - JVALU1
377  [13]  - JVIMUL
378
379
380  Resource pressure per iteration:
381  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
382   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
383
384  Resource pressure by instruction:
385  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
386   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
387   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
388   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
389
390According to this report, the dot-product kernel has been executed 300 times,
391for a total of 900 simulated instructions. The total number of simulated micro
392opcodes (uOps) is also 900.
393
394The report is structured in three main sections.  The first section collects a
395few performance numbers; the goal of this section is to give a very quick
396overview of the performance throughput. Important performance indicators are
397**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
398Throughput).
399
400Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
401to the out-of-order backend every simulated cycle. For processors with an
402in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
403to the backend every simulated cycle.
404
405IPC is computed dividing the total number of simulated instructions by the total
406number of cycles.
407
408Field *Block RThroughput* is the reciprocal of the block throughput. Block
409throughput is a theoretical quantity computed as the maximum number of blocks
410(i.e. iterations) that can be executed per simulated clock cycle in the absence
411of loop carried dependencies. Block throughput is superiorly limited by the
412dispatch rate, and the availability of hardware resources.
413
414In the absence of loop-carried data dependencies, the observed IPC tends to a
415theoretical maximum which can be computed by dividing the number of instructions
416of a single iteration by the `Block RThroughput`.
417
418Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
419opcodes by the total number of cycles. A delta between Dispatch Width and this
420field is an indicator of a performance issue. In the absence of loop-carried
421data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
422maximum throughput which can be computed by dividing the number of uOps of a
423single iteration by the `Block RThroughput`.
424
425Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
426because the dispatch width limits the maximum size of a dispatch group. Both IPC
427and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
428availability of hardware resources affects the resource pressure distribution,
429and it limits the number of instructions that can be executed in parallel every
430cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
431Cycle (computed by dividing the number of uOps of a single iteration by the
432`Block RThroughput`) is an indicator of a performance bottleneck caused by the
433lack of hardware resources.
434In general, the lower the Block RThroughput, the better.
435
436In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
437are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
438approach 1.50 when the number of iterations tends to infinity. The delta between
439the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
440an indicator of a performance bottleneck caused by the lack of hardware
441resources, and the *Resource pressure view* can help to identify the problematic
442resource usage.
443
444The second section of the report is the `instruction info view`. It shows the
445latency and reciprocal throughput of every instruction in the sequence. It also
446reports extra information related to the number of micro opcodes, and opcode
447properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
448
449Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
450is computed as the maximum number of instructions of a same type that can be
451executed per clock cycle in the absence of operand dependencies. In this
452example, the reciprocal throughput of a vector float multiply is 1
453cycles/instruction.  That is because the FP multiplier JFPM is only available
454from pipeline JFPU1.
455
456Instruction encodings are displayed within the instruction info view when flag
457`-show-encoding` is specified.
458
459Below is an example of `-show-encoding` output for the dot-product kernel:
460
461.. code-block:: none
462
463  Instruction Info:
464  [1]: #uOps
465  [2]: Latency
466  [3]: RThroughput
467  [4]: MayLoad
468  [5]: MayStore
469  [6]: HasSideEffects (U)
470  [7]: Encoding Size
471
472  [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
473   1      2     1.00                         4     c5 f0 59 d0                   vmulps	%xmm0, %xmm1, %xmm2
474   1      4     1.00                         4     c5 eb 7c da                   vhaddps	%xmm2, %xmm2, %xmm3
475   1      4     1.00                         4     c5 e3 7c e3                   vhaddps	%xmm3, %xmm3, %xmm4
476
477The `Encoding Size` column shows the size in bytes of instructions.  The
478`Encodings` column shows the actual instruction encodings (byte sequences in
479hex).
480
481The third section is the *Resource pressure view*.  This view reports
482the average number of resource cycles consumed every iteration by instructions
483for every processor resource unit available on the target.  Information is
484structured in two tables. The first table reports the number of resource cycles
485spent on average every iteration. The second table correlates the resource
486cycles to the machine instruction in the sequence. For example, every iteration
487of the instruction vmulps always executes on resource unit [6]
488(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
489per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
490only be issued to pipeline JFPU1, while horizontal floating-point additions can
491only be issued to pipeline JFPU0.
492
493The resource pressure view helps with identifying bottlenecks caused by high
494usage of specific hardware resources.  Situations with resource pressure mainly
495concentrated on a few resources should, in general, be avoided.  Ideally,
496pressure should be uniformly distributed between multiple resources.
497
498Timeline View
499^^^^^^^^^^^^^
500The timeline view produces a detailed report of each instruction's state
501transitions through an instruction pipeline.  This view is enabled by the
502command line option ``-timeline``.  As instructions transition through the
503various stages of the pipeline, their states are depicted in the view report.
504These states are represented by the following characters:
505
506* D : Instruction dispatched.
507* e : Instruction executing.
508* E : Instruction executed.
509* R : Instruction retired.
510* = : Instruction already dispatched, waiting to be executed.
511* \- : Instruction executed, waiting to be retired.
512
513Below is the timeline view for a subset of the dot-product example located in
514``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
515:program:`llvm-mca` using the following command:
516
517.. code-block:: bash
518
519  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
520
521.. code-block:: none
522
523  Timeline view:
524                      012345
525  Index     0123456789
526
527  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
528  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
529  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
530  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
531  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
532  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
533  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
534  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
535  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
536
537
538  Average Wait times (based on the timeline view):
539  [0]: Executions
540  [1]: Average time spent waiting in a scheduler's queue
541  [2]: Average time spent waiting in a scheduler's queue while ready
542  [3]: Average time elapsed from WB until retire stage
543
544        [0]    [1]    [2]    [3]
545  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
546  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
547  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
548         3     3.3    0.5    1.4       <total>
549
550The timeline view is interesting because it shows instruction state changes
551during execution.  It also gives an idea of how the tool processes instructions
552executed on the target, and how their timing information might be calculated.
553
554The timeline view is structured in two tables.  The first table shows
555instructions changing state over time (measured in cycles); the second table
556(named *Average Wait times*) reports useful timing statistics, which should
557help diagnose performance bottlenecks caused by long data dependencies and
558sub-optimal usage of hardware resources.
559
560An instruction in the timeline view is identified by a pair of indices, where
561the first index identifies an iteration, and the second index is the
562instruction index (i.e., where it appears in the code sequence).  Since this
563example was generated using 3 iterations: ``-iterations=3``, the iteration
564indices range from 0-2 inclusively.
565
566Excluding the first and last column, the remaining columns are in cycles.
567Cycles are numbered sequentially starting from 0.
568
569From the example output above, we know the following:
570
571* Instruction [1,0] was dispatched at cycle 1.
572* Instruction [1,0] started executing at cycle 2.
573* Instruction [1,0] reached the write back stage at cycle 4.
574* Instruction [1,0] was retired at cycle 10.
575
576Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
577scheduler's queue for the operands to become available. By the time vmulps is
578dispatched, operands are already available, and pipeline JFPU1 is ready to
579serve another instruction.  So the instruction can be immediately issued on the
580JFPU1 pipeline. That is demonstrated by the fact that the instruction only
581spent 1cy in the scheduler's queue.
582
583There is a gap of 5 cycles between the write-back stage and the retire event.
584That is because instructions must retire in program order, so [1,0] has to wait
585for [0,2] to be retired first (i.e., it has to wait until cycle 10).
586
587In the example, all instructions are in a RAW (Read After Write) dependency
588chain.  Register %xmm2 written by vmulps is immediately used by the first
589vhaddps, and register %xmm3 written by the first vhaddps is used by the second
590vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
591Parallelism).
592
593In the dot-product example, there are anti-dependencies introduced by
594instructions from different iterations.  However, those dependencies can be
595removed at register renaming stage (at the cost of allocating register aliases,
596and therefore consuming physical registers).
597
598Table *Average Wait times* helps diagnose performance issues that are caused by
599the presence of long latency instructions and potentially long data dependencies
600which may limit the ILP. Last row, ``<total>``, shows a global average over all
601instructions measured. Note that :program:`llvm-mca`, by default, assumes at
602least 1cy between the dispatch event and the issue event.
603
604When the performance is limited by data dependencies and/or long latency
605instructions, the number of cycles spent while in the *ready* state is expected
606to be very small when compared with the total number of cycles spent in the
607scheduler's queue.  The difference between the two counters is a good indicator
608of how large of an impact data dependencies had on the execution of the
609instructions.  When performance is mostly limited by the lack of hardware
610resources, the delta between the two counters is small.  However, the number of
611cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
612especially when compared to other low latency instructions.
613
614Bottleneck Analysis
615^^^^^^^^^^^^^^^^^^^
616The ``-bottleneck-analysis`` command line option enables the analysis of
617performance bottlenecks.
618
619This analysis is potentially expensive. It attempts to correlate increases in
620backend pressure (caused by pipeline resource pressure and data dependencies) to
621dynamic dispatch stalls.
622
623Below is an example of ``-bottleneck-analysis`` output generated by
624:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
625
626.. code-block:: none
627
628
629  Cycles with backend pressure increase [ 48.07% ]
630  Throughput Bottlenecks:
631    Resource Pressure       [ 47.77% ]
632    - JFPA  [ 47.77% ]
633    - JFPU0  [ 47.77% ]
634    Data Dependencies:      [ 0.30% ]
635    - Register Dependencies [ 0.30% ]
636    - Memory Dependencies   [ 0.00% ]
637
638  Critical sequence based on the simulation:
639
640                Instruction                         Dependency Information
641   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
642   |
643   |    < loop carried >
644   |
645   |      0.    vmulps  %xmm0, %xmm1, %xmm2
646   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
647   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
648   |
649   |    < loop carried >
650   |
651   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
652
653
654According to the analysis, throughput is limited by resource pressure and not by
655data dependencies.  The analysis observed increases in backend pressure during
65648.07% of the simulated run. Almost all those pressure increase events were
657caused by contention on processor resources JFPA/JFPU0.
658
659The `critical sequence` is the most expensive sequence of instructions according
660to the simulation. It is annotated to provide extra information about critical
661register dependencies and resource interferences between instructions.
662
663Instructions from the critical sequence are expected to significantly impact
664performance. By construction, the accuracy of this analysis is strongly
665dependent on the simulation and (as always) by the quality of the processor
666model in llvm.
667
668Bottleneck analysis is currently not supported for processors with an in-order
669backend.
670
671Extra Statistics to Further Diagnose Performance Issues
672^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
673The ``-all-stats`` command line option enables extra statistics and performance
674counters for the dispatch logic, the reorder buffer, the retire control unit,
675and the register file.
676
677Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
678for 300 iterations of the dot-product example discussed in the previous
679sections.
680
681.. code-block:: none
682
683  Dynamic Dispatch Stall Cycles:
684  RAT     - Register unavailable:                      0
685  RCU     - Retire tokens unavailable:                 0
686  SCHEDQ  - Scheduler full:                            272  (44.6%)
687  LQ      - Load queue full:                           0
688  SQ      - Store queue full:                          0
689  GROUP   - Static restrictions on the dispatch group: 0
690
691
692  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
693  [# dispatched], [# cycles]
694   0,              24  (3.9%)
695   1,              272  (44.6%)
696   2,              314  (51.5%)
697
698
699  Schedulers - number of cycles where we saw N micro opcodes issued:
700  [# issued], [# cycles]
701   0,          7  (1.1%)
702   1,          306  (50.2%)
703   2,          297  (48.7%)
704
705  Scheduler's queue usage:
706  [1] Resource name.
707  [2] Average number of used buffer entries.
708  [3] Maximum number of used buffer entries.
709  [4] Total number of buffer entries.
710
711   [1]            [2]        [3]        [4]
712  JALU01           0          0          20
713  JFPU01           17         18         18
714  JLSAGU           0          0          12
715
716
717  Retire Control Unit - number of cycles where we saw N instructions retired:
718  [# retired], [# cycles]
719   0,           109  (17.9%)
720   1,           102  (16.7%)
721   2,           399  (65.4%)
722
723  Total ROB Entries:                64
724  Max Used ROB Entries:             35  ( 54.7% )
725  Average Used ROB Entries per cy:  32  ( 50.0% )
726
727
728  Register File statistics:
729  Total number of mappings created:    900
730  Max number of mappings used:         35
731
732  *  Register File #1 -- JFpuPRF:
733     Number of physical registers:     72
734     Total number of mappings created: 900
735     Max number of mappings used:      35
736
737  *  Register File #2 -- JIntegerPRF:
738     Number of physical registers:     64
739     Total number of mappings created: 0
740     Max number of mappings used:      0
741
742If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
743SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
744logic is unable to dispatch a full group because the scheduler's queue is full.
745
746Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
747dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
748one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
749dispatch statistics are displayed by either using the command option
750``-all-stats`` or ``-dispatch-stats``.
751
752The next table, *Schedulers*, presents a histogram displaying a count,
753representing the number of micro opcodes issued on some number of cycles. In
754this case, of the 610 simulated cycles, single opcodes were issued 306 times
755(50.2%) and there were 7 cycles where no opcodes were issued.
756
757The *Scheduler's queue usage* table shows that the average and maximum number of
758buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
759reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
760three schedulers:
761
762* JALU01 - A scheduler for ALU instructions.
763* JFPU01 - A scheduler floating point operations.
764* JLSAGU - A scheduler for address generation.
765
766The dot-product is a kernel of three floating point instructions (a vector
767multiply followed by two horizontal adds).  That explains why only the floating
768point scheduler appears to be used.
769
770A full scheduler queue is either caused by data dependency chains or by a
771sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
772mitigated by rewriting the kernel using different instructions that consume
773different scheduler resources.  Schedulers with a small queue are less resilient
774to bottlenecks caused by the presence of long data dependencies.  The scheduler
775statistics are displayed by using the command option ``-all-stats`` or
776``-scheduler-stats``.
777
778The next table, *Retire Control Unit*, presents a histogram displaying a count,
779representing the number of instructions retired on some number of cycles.  In
780this case, of the 610 simulated cycles, two instructions were retired during the
781same cycle 399 times (65.4%) and there were 109 cycles where no instructions
782were retired.  The retire statistics are displayed by using the command option
783``-all-stats`` or ``-retire-stats``.
784
785The last table presented is *Register File statistics*.  Each physical register
786file (PRF) used by the pipeline is presented in this table.  In the case of AMD
787Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
788and one for integer registers (JIntegerPRF).  The table shows that of the 900
789instructions processed, there were 900 mappings created.  Since this dot-product
790example utilized only floating point registers, the JFPuPRF was responsible for
791creating the 900 mappings.  However, we see that the pipeline only used a
792maximum of 35 of 72 available register slots at any given time. We can conclude
793that the floating point PRF was the only register file used for the example, and
794that it was never resource constrained.  The register file statistics are
795displayed by using the command option ``-all-stats`` or
796``-register-file-stats``.
797
798In this example, we can conclude that the IPC is mostly limited by data
799dependencies, and not by resource pressure.
800
801Instruction Flow
802^^^^^^^^^^^^^^^^
803This section describes the instruction flow through the default pipeline of
804:program:`llvm-mca`, as well as the functional units involved in the process.
805
806The default pipeline implements the following sequence of stages used to
807process instructions.
808
809* Dispatch (Instruction is dispatched to the schedulers).
810* Issue (Instruction is issued to the processor pipelines).
811* Write Back (Instruction is executed, and results are written back).
812* Retire (Instruction is retired; writes are architecturally committed).
813
814The in-order pipeline implements the following sequence of stages:
815* InOrderIssue (Instruction is issued to the processor pipelines).
816* Retire (Instruction is retired; writes are architecturally committed).
817
818:program:`llvm-mca` assumes that instructions have all been decoded and placed
819into a queue before the simulation start. Therefore, the instruction fetch and
820decode stages are not modeled. Performance bottlenecks in the frontend are not
821diagnosed. Also, :program:`llvm-mca` does not model branch prediction.
822
823Instruction Dispatch
824""""""""""""""""""""
825During the dispatch stage, instructions are picked in program order from a
826queue of already decoded instructions, and dispatched in groups to the
827simulated hardware schedulers.
828
829The size of a dispatch group depends on the availability of the simulated
830hardware resources.  The processor dispatch width defaults to the value
831of the ``IssueWidth`` in LLVM's scheduling model.
832
833An instruction can be dispatched if:
834
835* The size of the dispatch group is smaller than processor's dispatch width.
836* There are enough entries in the reorder buffer.
837* There are enough physical registers to do register renaming.
838* The schedulers are not full.
839
840Scheduling models can optionally specify which register files are available on
841the processor. :program:`llvm-mca` uses that information to initialize register
842file descriptors.  Users can limit the number of physical registers that are
843globally available for register renaming by using the command option
844``-register-file-size``.  A value of zero for this option means *unbounded*. By
845knowing how many registers are available for renaming, the tool can predict
846dispatch stalls caused by the lack of physical registers.
847
848The number of reorder buffer entries consumed by an instruction depends on the
849number of micro-opcodes specified for that instruction by the target scheduling
850model.  The reorder buffer is responsible for tracking the progress of
851instructions that are "in-flight", and retiring them in program order.  The
852number of entries in the reorder buffer defaults to the value specified by field
853`MicroOpBufferSize` in the target scheduling model.
854
855Instructions that are dispatched to the schedulers consume scheduler buffer
856entries. :program:`llvm-mca` queries the scheduling model to determine the set
857of buffered resources consumed by an instruction.  Buffered resources are
858treated like scheduler resources.
859
860Instruction Issue
861"""""""""""""""""
862Each processor scheduler implements a buffer of instructions.  An instruction
863has to wait in the scheduler's buffer until input register operands become
864available.  Only at that point, does the instruction becomes eligible for
865execution and may be issued (potentially out-of-order) for execution.
866Instruction latencies are computed by :program:`llvm-mca` with the help of the
867scheduling model.
868
869:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
870schedulers.  The scheduler is responsible for tracking data dependencies, and
871dynamically selecting which processor resources are consumed by instructions.
872It delegates the management of processor resource units and resource groups to a
873resource manager.  The resource manager is responsible for selecting resource
874units that are consumed by instructions.  For example, if an instruction
875consumes 1cy of a resource group, the resource manager selects one of the
876available units from the group; by default, the resource manager uses a
877round-robin selector to guarantee that resource usage is uniformly distributed
878between all units of a group.
879
880:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
881
882* WaitSet: a set of instructions whose operands are not ready.
883* ReadySet: a set of instructions ready to execute.
884* IssuedSet: a set of instructions executing.
885
886Depending on the operands availability, instructions that are dispatched to the
887scheduler are either placed into the WaitSet or into the ReadySet.
888
889Every cycle, the scheduler checks if instructions can be moved from the WaitSet
890to the ReadySet, and if instructions from the ReadySet can be issued to the
891underlying pipelines. The algorithm prioritizes older instructions over younger
892instructions.
893
894Write-Back and Retire Stage
895"""""""""""""""""""""""""""
896Issued instructions are moved from the ReadySet to the IssuedSet.  There,
897instructions wait until they reach the write-back stage.  At that point, they
898get removed from the queue and the retire control unit is notified.
899
900When instructions are executed, the retire control unit flags the instruction as
901"ready to retire."
902
903Instructions are retired in program order.  The register file is notified of the
904retirement so that it can free the physical registers that were allocated for
905the instruction during the register renaming stage.
906
907Load/Store Unit and Memory Consistency Model
908""""""""""""""""""""""""""""""""""""""""""""
909To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
910utilizes a simulated load/store unit (LSUnit) to simulate the speculative
911execution of loads and stores.
912
913Each load (or store) consumes an entry in the load (or store) queue. Users can
914specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
915load and store queues respectively. The queues are unbounded by default.
916
917The LSUnit implements a relaxed consistency model for memory loads and stores.
918The rules are:
919
9201. A younger load is allowed to pass an older load only if there are no
921   intervening stores or barriers between the two loads.
9222. A younger load is allowed to pass an older store provided that the load does
923   not alias with the store.
9243. A younger store is not allowed to pass an older store.
9254. A younger store is not allowed to pass an older load.
926
927By default, the LSUnit optimistically assumes that loads do not alias
928(`-noalias=true`) store operations.  Under this assumption, younger loads are
929always allowed to pass older stores.  Essentially, the LSUnit does not attempt
930to run any alias analysis to predict when loads and stores do not alias with
931each other.
932
933Note that, in the case of write-combining memory, rule 3 could be relaxed to
934allow reordering of non-aliasing store operations.  That being said, at the
935moment, there is no way to further relax the memory model (``-noalias`` is the
936only option).  Essentially, there is no option to specify a different memory
937type (e.g., write-back, write-combining, write-through; etc.) and consequently
938to weaken, or strengthen, the memory model.
939
940Other limitations are:
941
942* The LSUnit does not know when store-to-load forwarding may occur.
943* The LSUnit does not know anything about cache hierarchy and memory types.
944* The LSUnit does not know how to identify serializing operations and memory
945  fences.
946
947The LSUnit does not attempt to predict if a load or store hits or misses the L1
948cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
949loads, the scheduling model provides an "optimistic" load-to-use latency (which
950usually matches the load-to-use latency for when there is a hit in the L1D).
951
952:program:`llvm-mca` does not know about serializing operations or memory-barrier
953like instructions.  The LSUnit conservatively assumes that an instruction which
954has both "MayLoad" and unmodeled side effects behaves like a "soft"
955load-barrier.  That means, it serializes loads without forcing a flush of the
956load queue.  Similarly, instructions that "MayStore" and have unmodeled side
957effects are treated like store barriers.  A full memory barrier is a "MayLoad"
958and "MayStore" instruction with unmodeled side effects.  This is inaccurate, but
959it is the best that we can do at the moment with the current information
960available in LLVM.
961
962A load/store barrier consumes one entry of the load/store queue.  A load/store
963barrier enforces ordering of loads/stores.  A younger load cannot pass a load
964barrier.  Also, a younger store cannot pass a store barrier.  A younger load
965has to wait for the memory/load barrier to execute.  A load/store barrier is
966"executed" when it becomes the oldest entry in the load/store queue(s). That
967also means, by construction, all of the older loads/stores have been executed.
968
969In conclusion, the full set of load/store consistency rules are:
970
971#. A store may not pass a previous store.
972#. A store may not pass a previous load (regardless of ``-noalias``).
973#. A store has to wait until an older store barrier is fully executed.
974#. A load may pass a previous load.
975#. A load may not pass a previous store unless ``-noalias`` is set.
976#. A load has to wait until an older load barrier is fully executed.
977
978In-order Issue and Execute
979""""""""""""""""""""""""""""""""""""
980In-order processors are modelled as a single ``InOrderIssueStage`` stage. It
981bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
982soon as their operand registers are available and resource requirements are
983met. Multiple instructions can be issued in one cycle according to the value of
984the ``IssueWidth`` parameter in LLVM's scheduling model.
985
986Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to
987retire. :program:`llvm-mca` ensures that writes are committed in-order. However,
988an instruction is allowed to commit writes and retire out-of-order if
989``RetireOOO`` property is true for at least one of its writes.
990
991Custom Behaviour
992""""""""""""""""""""""""""""""""""""
993Due to certain instructions not being expressed perfectly within their
994scheduling model, :program:`llvm-mca` isn't always able to simulate them
995perfectly. Modifying the scheduling model isn't always a viable
996option though (maybe because the instruction is modeled incorrectly on
997purpose or the instruction's behaviour is quite complex). The
998CustomBehaviour class can be used in these cases to enforce proper
999instruction modeling (often by customizing data dependencies and detecting
1000hazards that :program:`llvm-mca` has no way of knowing about).
1001
1002:program:`llvm-mca` comes with one generic and multiple target specific
1003CustomBehaviour classes. The generic class will be used if the ``-disable-cb``
1004flag is used or if a target specific CustomBehaviour class doesn't exist for
1005that target. (The generic class does nothing.) Currently, the CustomBehaviour
1006class is only a part of the in-order pipeline, but there are plans to add it
1007to the out-of-order pipeline in the future.
1008
1009CustomBehaviour's main method is `checkCustomHazard()` which uses the
1010current instruction and a list of all instructions still executing within
1011the pipeline to determine if the current instruction should be dispatched.
1012As output, the method returns an integer representing the number of cycles
1013that the current instruction must stall for (this can be an underestimate
1014if you don't know the exact number and a value of 0 represents no stall).
1015
1016If you'd like to add a CustomBehaviour class for a target that doesn't
1017already have one, refer to an existing implementation to see how to set it
1018up. The classes are implemented within the target specific backend (for
1019example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols.
1020
1021Custom Views
1022""""""""""""""""""""""""""""""""""""
1023:program:`llvm-mca` comes with several Views such as the Timeline View and
1024Summary View. These Views are generic and can work with most (if not all)
1025targets. If you wish to add a new View to :program:`llvm-mca` and it does not
1026require any backend functionality that is not already exposed through MC layer
1027classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
1028`/tools/llvm-mca/View/` directory. However, if your new View is target specific
1029AND requires unexposed backend symbols or functionality, you can define it in
1030the `/lib/Target/<TargetName>/MCA/` directory.
1031
1032To enable this target specific View, you will have to use this target's
1033CustomBehaviour class to override the `CustomBehaviour::getViews()` methods.
1034There are 3 variations of these methods based on where you want your View to
1035appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and
1036`getEndViews()`. These methods returns a vector of Views so you will want to
1037return a vector containing all of the target specific Views for the target in
1038question.
1039
1040Because these target specific (and backend dependent) Views require the
1041`CustomBehaviour::getViews()` variants, these Views will not be enabled if
1042the `-disable-cb` flag is used.
1043
1044Enabling these custom Views does not affect the non-custom (generic) Views.
1045Continue to use the usual command line arguments to enable / disable those
1046Views.
1047