1llvm-mca - LLVM Machine Code Analyzer 2===================================== 3 4.. program:: llvm-mca 5 6SYNOPSIS 7-------- 8 9:program:`llvm-mca` [*options*] [input] 10 11DESCRIPTION 12----------- 13 14:program:`llvm-mca` is a performance analysis tool that uses information 15available in LLVM (e.g. scheduling models) to statically measure the performance 16of machine code in a specific CPU. 17 18Performance is measured in terms of throughput as well as processor resource 19consumption. The tool currently works for processors with a backend for which 20there is a scheduling model available in LLVM. 21 22The main goal of this tool is not just to predict the performance of the code 23when run on the target, but also help with diagnosing potential performance 24issues. 25 26Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions 27Per Cycle (IPC), as well as hardware resource pressure. The analysis and 28reporting style were inspired by the IACA tool from Intel. 29 30For example, you can compile code with clang, output assembly, and pipe it 31directly into :program:`llvm-mca` for analysis: 32 33.. code-block:: bash 34 35 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 36 37Or for Intel syntax: 38 39.. code-block:: bash 40 41 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 42 43(:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax` 44directive at the beginning of the input. By default its output syntax matches 45that of its input.) 46 47Scheduling models are not just used to compute instruction latencies and 48throughput, but also to understand what processor resources are available 49and how to simulate them. 50 51By design, the quality of the analysis conducted by :program:`llvm-mca` is 52inevitably affected by the quality of the scheduling models in LLVM. 53 54If you see that the performance report is not accurate for a processor, 55please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_ 56against the appropriate backend. 57 58OPTIONS 59------- 60 61If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard 62input. Otherwise, it will read from the specified filename. 63 64If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output 65to standard output if the input is from standard input. If the :option:`-o` 66option specifies "``-``", then the output will also be sent to standard output. 67 68 69.. option:: -help 70 71 Print a summary of command line options. 72 73.. option:: -o <filename> 74 75 Use ``<filename>`` as the output filename. See the summary above for more 76 details. 77 78.. option:: -mtriple=<target triple> 79 80 Specify a target triple string. 81 82.. option:: -march=<arch> 83 84 Specify the architecture for which to analyze the code. It defaults to the 85 host default target. 86 87.. option:: -mcpu=<cpuname> 88 89 Specify the processor for which to analyze the code. By default, the cpu name 90 is autodetected from the host. 91 92.. option:: -output-asm-variant=<variant id> 93 94 Specify the output assembly variant for the report generated by the tool. 95 On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables 96 the AT&T (vic. Intel) assembly format for the code printed out by the tool in 97 the analysis report. 98 99.. option:: -print-imm-hex 100 101 Prefer hex format for numeric literals in the output assembly printed as part 102 of the report. 103 104.. option:: -dispatch=<width> 105 106 Specify a different dispatch width for the processor. The dispatch width 107 defaults to field 'IssueWidth' in the processor scheduling model. If width is 108 zero, then the default dispatch width is used. 109 110.. option:: -register-file-size=<size> 111 112 Specify the size of the register file. When specified, this flag limits how 113 many physical registers are available for register renaming purposes. A value 114 of zero for this flag means "unlimited number of physical registers". 115 116.. option:: -iterations=<number of iterations> 117 118 Specify the number of iterations to run. If this flag is set to 0, then the 119 tool sets the number of iterations to a default value (i.e. 100). 120 121.. option:: -noalias=<bool> 122 123 If set, the tool assumes that loads and stores don't alias. This is the 124 default behavior. 125 126.. option:: -lqueue=<load queue size> 127 128 Specify the size of the load queue in the load/store unit emulated by the tool. 129 By default, the tool assumes an unbound number of entries in the load queue. 130 A value of zero for this flag is ignored, and the default load queue size is 131 used instead. 132 133.. option:: -squeue=<store queue size> 134 135 Specify the size of the store queue in the load/store unit emulated by the 136 tool. By default, the tool assumes an unbound number of entries in the store 137 queue. A value of zero for this flag is ignored, and the default store queue 138 size is used instead. 139 140.. option:: -timeline 141 142 Enable the timeline view. 143 144.. option:: -timeline-max-iterations=<iterations> 145 146 Limit the number of iterations to print in the timeline view. By default, the 147 timeline view prints information for up to 10 iterations. 148 149.. option:: -timeline-max-cycles=<cycles> 150 151 Limit the number of cycles in the timeline view, or use 0 for no limit. By 152 default, the number of cycles is set to 80. 153 154.. option:: -resource-pressure 155 156 Enable the resource pressure view. This is enabled by default. 157 158.. option:: -register-file-stats 159 160 Enable register file usage statistics. 161 162.. option:: -dispatch-stats 163 164 Enable extra dispatch statistics. This view collects and analyzes instruction 165 dispatch events, as well as static/dynamic dispatch stall events. This view 166 is disabled by default. 167 168.. option:: -scheduler-stats 169 170 Enable extra scheduler statistics. This view collects and analyzes instruction 171 issue events. This view is disabled by default. 172 173.. option:: -retire-stats 174 175 Enable extra retire control unit statistics. This view is disabled by default. 176 177.. option:: -instruction-info 178 179 Enable the instruction info view. This is enabled by default. 180 181.. option:: -show-encoding 182 183 Enable the printing of instruction encodings within the instruction info view. 184 185.. option:: -show-barriers 186 187 Enable the printing of LoadBarrier and StoreBarrier flags within the 188 instruction info view. 189 190.. option:: -all-stats 191 192 Print all hardware statistics. This enables extra statistics related to the 193 dispatch logic, the hardware schedulers, the register file(s), and the retire 194 control unit. This option is disabled by default. 195 196.. option:: -all-views 197 198 Enable all the view. 199 200.. option:: -instruction-tables 201 202 Prints resource pressure information based on the static information 203 available from the processor model. This differs from the resource pressure 204 view because it doesn't require that the code is simulated. It instead prints 205 the theoretical uniform distribution of resource pressure for every 206 instruction in sequence. 207 208.. option:: -bottleneck-analysis 209 210 Print information about bottlenecks that affect the throughput. This analysis 211 can be expensive, and it is disabled by default. Bottlenecks are highlighted 212 in the summary view. Bottleneck analysis is currently not supported for 213 processors with an in-order backend. 214 215.. option:: -json 216 217 Print the requested views in valid JSON format. The instructions and the 218 processor resources are printed as members of special top level JSON objects. 219 The individual views refer to them by index. However, not all views are 220 currently supported. For example, the report from the bottleneck analysis is 221 not printed out in JSON. All the default views are currently supported. 222 223.. option:: -disable-cb 224 225 Force usage of the generic CustomBehaviour and InstrPostProcess classes rather 226 than using the target specific implementation. The generic classes never 227 detect any custom hazards or make any post processing modifications to 228 instructions. 229 230 231EXIT STATUS 232----------- 233 234:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed 235to standard error, and the tool returns 1. 236 237USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS 238--------------------------------------------- 239:program:`llvm-mca` allows for the optional usage of special code comments to 240mark regions of the assembly code to be analyzed. A comment starting with 241substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment 242starting with substring ``LLVM-MCA-END`` marks the end of a code region. For 243example: 244 245.. code-block:: none 246 247 # LLVM-MCA-BEGIN 248 ... 249 # LLVM-MCA-END 250 251If no user-defined region is specified, then :program:`llvm-mca` assumes a 252default region which contains every instruction in the input file. Every region 253is analyzed in isolation, and the final performance report is the union of all 254the reports generated for every code region. 255 256Code regions can have names. For example: 257 258.. code-block:: none 259 260 # LLVM-MCA-BEGIN A simple example 261 add %eax, %eax 262 # LLVM-MCA-END 263 264The code from the example above defines a region named "A simple example" with a 265single instruction in it. Note how the region name doesn't have to be repeated 266in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, 267an anonymous ``LLVM-MCA-END`` directive always ends the currently active user 268defined region. 269 270Example of nesting regions: 271 272.. code-block:: none 273 274 # LLVM-MCA-BEGIN foo 275 add %eax, %edx 276 # LLVM-MCA-BEGIN bar 277 sub %eax, %edx 278 # LLVM-MCA-END bar 279 # LLVM-MCA-END foo 280 281Example of overlapping regions: 282 283.. code-block:: none 284 285 # LLVM-MCA-BEGIN foo 286 add %eax, %edx 287 # LLVM-MCA-BEGIN bar 288 sub %eax, %edx 289 # LLVM-MCA-END foo 290 add %eax, %edx 291 # LLVM-MCA-END bar 292 293Note that multiple anonymous regions cannot overlap. Also, overlapping regions 294cannot have the same name. 295 296There is no support for marking regions from high-level source code, like C or 297C++. As a workaround, inline assembly directives may be used: 298 299.. code-block:: c++ 300 301 int foo(int a, int b) { 302 __asm volatile("# LLVM-MCA-BEGIN foo":::"memory"); 303 a += 42; 304 __asm volatile("# LLVM-MCA-END":::"memory"); 305 a *= b; 306 return a; 307 } 308 309However, this interferes with optimizations like loop vectorization and may have 310an impact on the code generated. This is because the ``__asm`` statements are 311seen as real code having important side effects, which limits how the code 312around them can be transformed. If users want to make use of inline assembly 313to emit markers, then the recommendation is to always verify that the output 314assembly is equivalent to the assembly generated in the absence of markers. 315The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ 316can also help in detecting missed optimizations. 317 318HOW LLVM-MCA WORKS 319------------------ 320 321:program:`llvm-mca` takes assembly code as input. The assembly code is parsed 322into a sequence of MCInst with the help of the existing LLVM target assembly 323parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module 324to generate a performance report. 325 326The Pipeline module simulates the execution of the machine code sequence in a 327loop of iterations (default is 100). During this process, the pipeline collects 328a number of execution related statistics. At the end of this process, the 329pipeline generates and prints a report from the collected statistics. 330 331Here is an example of a performance report generated by the tool for a 332dot-product of two packed float vectors of four elements. The analysis is 333conducted for target x86, cpu btver2. The following result can be produced via 334the following command using the example located at 335``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: 336 337.. code-block:: bash 338 339 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s 340 341.. code-block:: none 342 343 Iterations: 300 344 Instructions: 900 345 Total Cycles: 610 346 Total uOps: 900 347 348 Dispatch Width: 2 349 uOps Per Cycle: 1.48 350 IPC: 1.48 351 Block RThroughput: 2.0 352 353 354 Instruction Info: 355 [1]: #uOps 356 [2]: Latency 357 [3]: RThroughput 358 [4]: MayLoad 359 [5]: MayStore 360 [6]: HasSideEffects (U) 361 362 [1] [2] [3] [4] [5] [6] Instructions: 363 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 364 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 365 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 366 367 368 Resources: 369 [0] - JALU0 370 [1] - JALU1 371 [2] - JDiv 372 [3] - JFPA 373 [4] - JFPM 374 [5] - JFPU0 375 [6] - JFPU1 376 [7] - JLAGU 377 [8] - JMul 378 [9] - JSAGU 379 [10] - JSTC 380 [11] - JVALU0 381 [12] - JVALU1 382 [13] - JVIMUL 383 384 385 Resource pressure per iteration: 386 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 387 - - - 2.00 1.00 2.00 1.00 - - - - - - - 388 389 Resource pressure by instruction: 390 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: 391 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 392 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 393 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 394 395According to this report, the dot-product kernel has been executed 300 times, 396for a total of 900 simulated instructions. The total number of simulated micro 397opcodes (uOps) is also 900. 398 399The report is structured in three main sections. The first section collects a 400few performance numbers; the goal of this section is to give a very quick 401overview of the performance throughput. Important performance indicators are 402**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal 403Throughput). 404 405Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched 406to the out-of-order backend every simulated cycle. For processors with an 407in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued 408to the backend every simulated cycle. 409 410IPC is computed dividing the total number of simulated instructions by the total 411number of cycles. 412 413Field *Block RThroughput* is the reciprocal of the block throughput. Block 414throughput is a theoretical quantity computed as the maximum number of blocks 415(i.e. iterations) that can be executed per simulated clock cycle in the absence 416of loop carried dependencies. Block throughput is superiorly limited by the 417dispatch rate, and the availability of hardware resources. 418 419In the absence of loop-carried data dependencies, the observed IPC tends to a 420theoretical maximum which can be computed by dividing the number of instructions 421of a single iteration by the `Block RThroughput`. 422 423Field 'uOps Per Cycle' is computed dividing the total number of simulated micro 424opcodes by the total number of cycles. A delta between Dispatch Width and this 425field is an indicator of a performance issue. In the absence of loop-carried 426data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical 427maximum throughput which can be computed by dividing the number of uOps of a 428single iteration by the `Block RThroughput`. 429 430Field *uOps Per Cycle* is bounded from above by the dispatch width. That is 431because the dispatch width limits the maximum size of a dispatch group. Both IPC 432and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The 433availability of hardware resources affects the resource pressure distribution, 434and it limits the number of instructions that can be executed in parallel every 435cycle. A delta between Dispatch Width and the theoretical maximum uOps per 436Cycle (computed by dividing the number of uOps of a single iteration by the 437`Block RThroughput`) is an indicator of a performance bottleneck caused by the 438lack of hardware resources. 439In general, the lower the Block RThroughput, the better. 440 441In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there 442are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to 443approach 1.50 when the number of iterations tends to infinity. The delta between 444the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is 445an indicator of a performance bottleneck caused by the lack of hardware 446resources, and the *Resource pressure view* can help to identify the problematic 447resource usage. 448 449The second section of the report is the `instruction info view`. It shows the 450latency and reciprocal throughput of every instruction in the sequence. It also 451reports extra information related to the number of micro opcodes, and opcode 452properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). 453 454Field *RThroughput* is the reciprocal of the instruction throughput. Throughput 455is computed as the maximum number of instructions of a same type that can be 456executed per clock cycle in the absence of operand dependencies. In this 457example, the reciprocal throughput of a vector float multiply is 1 458cycles/instruction. That is because the FP multiplier JFPM is only available 459from pipeline JFPU1. 460 461Instruction encodings are displayed within the instruction info view when flag 462`-show-encoding` is specified. 463 464Below is an example of `-show-encoding` output for the dot-product kernel: 465 466.. code-block:: none 467 468 Instruction Info: 469 [1]: #uOps 470 [2]: Latency 471 [3]: RThroughput 472 [4]: MayLoad 473 [5]: MayStore 474 [6]: HasSideEffects (U) 475 [7]: Encoding Size 476 477 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 478 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 479 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 480 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 481 482The `Encoding Size` column shows the size in bytes of instructions. The 483`Encodings` column shows the actual instruction encodings (byte sequences in 484hex). 485 486The third section is the *Resource pressure view*. This view reports 487the average number of resource cycles consumed every iteration by instructions 488for every processor resource unit available on the target. Information is 489structured in two tables. The first table reports the number of resource cycles 490spent on average every iteration. The second table correlates the resource 491cycles to the machine instruction in the sequence. For example, every iteration 492of the instruction vmulps always executes on resource unit [6] 493(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle 494per iteration. Note that on AMD Jaguar, vector floating-point multiply can 495only be issued to pipeline JFPU1, while horizontal floating-point additions can 496only be issued to pipeline JFPU0. 497 498The resource pressure view helps with identifying bottlenecks caused by high 499usage of specific hardware resources. Situations with resource pressure mainly 500concentrated on a few resources should, in general, be avoided. Ideally, 501pressure should be uniformly distributed between multiple resources. 502 503Timeline View 504^^^^^^^^^^^^^ 505The timeline view produces a detailed report of each instruction's state 506transitions through an instruction pipeline. This view is enabled by the 507command line option ``-timeline``. As instructions transition through the 508various stages of the pipeline, their states are depicted in the view report. 509These states are represented by the following characters: 510 511* D : Instruction dispatched. 512* e : Instruction executing. 513* E : Instruction executed. 514* R : Instruction retired. 515* = : Instruction already dispatched, waiting to be executed. 516* \- : Instruction executed, waiting to be retired. 517 518Below is the timeline view for a subset of the dot-product example located in 519``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by 520:program:`llvm-mca` using the following command: 521 522.. code-block:: bash 523 524 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s 525 526.. code-block:: none 527 528 Timeline view: 529 012345 530 Index 0123456789 531 532 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 533 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 534 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 535 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 536 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 537 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 538 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 539 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 540 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 541 542 543 Average Wait times (based on the timeline view): 544 [0]: Executions 545 [1]: Average time spent waiting in a scheduler's queue 546 [2]: Average time spent waiting in a scheduler's queue while ready 547 [3]: Average time elapsed from WB until retire stage 548 549 [0] [1] [2] [3] 550 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 551 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 552 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 553 3 3.3 0.5 1.4 <total> 554 555The timeline view is interesting because it shows instruction state changes 556during execution. It also gives an idea of how the tool processes instructions 557executed on the target, and how their timing information might be calculated. 558 559The timeline view is structured in two tables. The first table shows 560instructions changing state over time (measured in cycles); the second table 561(named *Average Wait times*) reports useful timing statistics, which should 562help diagnose performance bottlenecks caused by long data dependencies and 563sub-optimal usage of hardware resources. 564 565An instruction in the timeline view is identified by a pair of indices, where 566the first index identifies an iteration, and the second index is the 567instruction index (i.e., where it appears in the code sequence). Since this 568example was generated using 3 iterations: ``-iterations=3``, the iteration 569indices range from 0-2 inclusively. 570 571Excluding the first and last column, the remaining columns are in cycles. 572Cycles are numbered sequentially starting from 0. 573 574From the example output above, we know the following: 575 576* Instruction [1,0] was dispatched at cycle 1. 577* Instruction [1,0] started executing at cycle 2. 578* Instruction [1,0] reached the write back stage at cycle 4. 579* Instruction [1,0] was retired at cycle 10. 580 581Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the 582scheduler's queue for the operands to become available. By the time vmulps is 583dispatched, operands are already available, and pipeline JFPU1 is ready to 584serve another instruction. So the instruction can be immediately issued on the 585JFPU1 pipeline. That is demonstrated by the fact that the instruction only 586spent 1cy in the scheduler's queue. 587 588There is a gap of 5 cycles between the write-back stage and the retire event. 589That is because instructions must retire in program order, so [1,0] has to wait 590for [0,2] to be retired first (i.e., it has to wait until cycle 10). 591 592In the example, all instructions are in a RAW (Read After Write) dependency 593chain. Register %xmm2 written by vmulps is immediately used by the first 594vhaddps, and register %xmm3 written by the first vhaddps is used by the second 595vhaddps. Long data dependencies negatively impact the ILP (Instruction Level 596Parallelism). 597 598In the dot-product example, there are anti-dependencies introduced by 599instructions from different iterations. However, those dependencies can be 600removed at register renaming stage (at the cost of allocating register aliases, 601and therefore consuming physical registers). 602 603Table *Average Wait times* helps diagnose performance issues that are caused by 604the presence of long latency instructions and potentially long data dependencies 605which may limit the ILP. Last row, ``<total>``, shows a global average over all 606instructions measured. Note that :program:`llvm-mca`, by default, assumes at 607least 1cy between the dispatch event and the issue event. 608 609When the performance is limited by data dependencies and/or long latency 610instructions, the number of cycles spent while in the *ready* state is expected 611to be very small when compared with the total number of cycles spent in the 612scheduler's queue. The difference between the two counters is a good indicator 613of how large of an impact data dependencies had on the execution of the 614instructions. When performance is mostly limited by the lack of hardware 615resources, the delta between the two counters is small. However, the number of 616cycles spent in the queue tends to be larger (i.e., more than 1-3cy), 617especially when compared to other low latency instructions. 618 619Bottleneck Analysis 620^^^^^^^^^^^^^^^^^^^ 621The ``-bottleneck-analysis`` command line option enables the analysis of 622performance bottlenecks. 623 624This analysis is potentially expensive. It attempts to correlate increases in 625backend pressure (caused by pipeline resource pressure and data dependencies) to 626dynamic dispatch stalls. 627 628Below is an example of ``-bottleneck-analysis`` output generated by 629:program:`llvm-mca` for 500 iterations of the dot-product example on btver2. 630 631.. code-block:: none 632 633 634 Cycles with backend pressure increase [ 48.07% ] 635 Throughput Bottlenecks: 636 Resource Pressure [ 47.77% ] 637 - JFPA [ 47.77% ] 638 - JFPU0 [ 47.77% ] 639 Data Dependencies: [ 0.30% ] 640 - Register Dependencies [ 0.30% ] 641 - Memory Dependencies [ 0.00% ] 642 643 Critical sequence based on the simulation: 644 645 Instruction Dependency Information 646 +----< 2. vhaddps %xmm3, %xmm3, %xmm4 647 | 648 | < loop carried > 649 | 650 | 0. vmulps %xmm0, %xmm1, %xmm2 651 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 652 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 653 | 654 | < loop carried > 655 | 656 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] 657 658 659According to the analysis, throughput is limited by resource pressure and not by 660data dependencies. The analysis observed increases in backend pressure during 66148.07% of the simulated run. Almost all those pressure increase events were 662caused by contention on processor resources JFPA/JFPU0. 663 664The `critical sequence` is the most expensive sequence of instructions according 665to the simulation. It is annotated to provide extra information about critical 666register dependencies and resource interferences between instructions. 667 668Instructions from the critical sequence are expected to significantly impact 669performance. By construction, the accuracy of this analysis is strongly 670dependent on the simulation and (as always) by the quality of the processor 671model in llvm. 672 673Bottleneck analysis is currently not supported for processors with an in-order 674backend. 675 676Extra Statistics to Further Diagnose Performance Issues 677^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 678The ``-all-stats`` command line option enables extra statistics and performance 679counters for the dispatch logic, the reorder buffer, the retire control unit, 680and the register file. 681 682Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` 683for 300 iterations of the dot-product example discussed in the previous 684sections. 685 686.. code-block:: none 687 688 Dynamic Dispatch Stall Cycles: 689 RAT - Register unavailable: 0 690 RCU - Retire tokens unavailable: 0 691 SCHEDQ - Scheduler full: 272 (44.6%) 692 LQ - Load queue full: 0 693 SQ - Store queue full: 0 694 GROUP - Static restrictions on the dispatch group: 0 695 696 697 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: 698 [# dispatched], [# cycles] 699 0, 24 (3.9%) 700 1, 272 (44.6%) 701 2, 314 (51.5%) 702 703 704 Schedulers - number of cycles where we saw N micro opcodes issued: 705 [# issued], [# cycles] 706 0, 7 (1.1%) 707 1, 306 (50.2%) 708 2, 297 (48.7%) 709 710 Scheduler's queue usage: 711 [1] Resource name. 712 [2] Average number of used buffer entries. 713 [3] Maximum number of used buffer entries. 714 [4] Total number of buffer entries. 715 716 [1] [2] [3] [4] 717 JALU01 0 0 20 718 JFPU01 17 18 18 719 JLSAGU 0 0 12 720 721 722 Retire Control Unit - number of cycles where we saw N instructions retired: 723 [# retired], [# cycles] 724 0, 109 (17.9%) 725 1, 102 (16.7%) 726 2, 399 (65.4%) 727 728 Total ROB Entries: 64 729 Max Used ROB Entries: 35 ( 54.7% ) 730 Average Used ROB Entries per cy: 32 ( 50.0% ) 731 732 733 Register File statistics: 734 Total number of mappings created: 900 735 Max number of mappings used: 35 736 737 * Register File #1 -- JFpuPRF: 738 Number of physical registers: 72 739 Total number of mappings created: 900 740 Max number of mappings used: 35 741 742 * Register File #2 -- JIntegerPRF: 743 Number of physical registers: 64 744 Total number of mappings created: 0 745 Max number of mappings used: 0 746 747If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for 748SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch 749logic is unable to dispatch a full group because the scheduler's queue is full. 750 751Looking at the *Dispatch Logic* table, we see that the pipeline was only able to 752dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to 753one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The 754dispatch statistics are displayed by either using the command option 755``-all-stats`` or ``-dispatch-stats``. 756 757The next table, *Schedulers*, presents a histogram displaying a count, 758representing the number of micro opcodes issued on some number of cycles. In 759this case, of the 610 simulated cycles, single opcodes were issued 306 times 760(50.2%) and there were 7 cycles where no opcodes were issued. 761 762The *Scheduler's queue usage* table shows that the average and maximum number of 763buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 764reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements 765three schedulers: 766 767* JALU01 - A scheduler for ALU instructions. 768* JFPU01 - A scheduler floating point operations. 769* JLSAGU - A scheduler for address generation. 770 771The dot-product is a kernel of three floating point instructions (a vector 772multiply followed by two horizontal adds). That explains why only the floating 773point scheduler appears to be used. 774 775A full scheduler queue is either caused by data dependency chains or by a 776sub-optimal usage of hardware resources. Sometimes, resource pressure can be 777mitigated by rewriting the kernel using different instructions that consume 778different scheduler resources. Schedulers with a small queue are less resilient 779to bottlenecks caused by the presence of long data dependencies. The scheduler 780statistics are displayed by using the command option ``-all-stats`` or 781``-scheduler-stats``. 782 783The next table, *Retire Control Unit*, presents a histogram displaying a count, 784representing the number of instructions retired on some number of cycles. In 785this case, of the 610 simulated cycles, two instructions were retired during the 786same cycle 399 times (65.4%) and there were 109 cycles where no instructions 787were retired. The retire statistics are displayed by using the command option 788``-all-stats`` or ``-retire-stats``. 789 790The last table presented is *Register File statistics*. Each physical register 791file (PRF) used by the pipeline is presented in this table. In the case of AMD 792Jaguar, there are two register files, one for floating-point registers (JFpuPRF) 793and one for integer registers (JIntegerPRF). The table shows that of the 900 794instructions processed, there were 900 mappings created. Since this dot-product 795example utilized only floating point registers, the JFPuPRF was responsible for 796creating the 900 mappings. However, we see that the pipeline only used a 797maximum of 35 of 72 available register slots at any given time. We can conclude 798that the floating point PRF was the only register file used for the example, and 799that it was never resource constrained. The register file statistics are 800displayed by using the command option ``-all-stats`` or 801``-register-file-stats``. 802 803In this example, we can conclude that the IPC is mostly limited by data 804dependencies, and not by resource pressure. 805 806Instruction Flow 807^^^^^^^^^^^^^^^^ 808This section describes the instruction flow through the default pipeline of 809:program:`llvm-mca`, as well as the functional units involved in the process. 810 811The default pipeline implements the following sequence of stages used to 812process instructions. 813 814* Dispatch (Instruction is dispatched to the schedulers). 815* Issue (Instruction is issued to the processor pipelines). 816* Write Back (Instruction is executed, and results are written back). 817* Retire (Instruction is retired; writes are architecturally committed). 818 819The in-order pipeline implements the following sequence of stages: 820* InOrderIssue (Instruction is issued to the processor pipelines). 821* Retire (Instruction is retired; writes are architecturally committed). 822 823:program:`llvm-mca` assumes that instructions have all been decoded and placed 824into a queue before the simulation start. Therefore, the instruction fetch and 825decode stages are not modeled. Performance bottlenecks in the frontend are not 826diagnosed. Also, :program:`llvm-mca` does not model branch prediction. 827 828Instruction Dispatch 829"""""""""""""""""""" 830During the dispatch stage, instructions are picked in program order from a 831queue of already decoded instructions, and dispatched in groups to the 832simulated hardware schedulers. 833 834The size of a dispatch group depends on the availability of the simulated 835hardware resources. The processor dispatch width defaults to the value 836of the ``IssueWidth`` in LLVM's scheduling model. 837 838An instruction can be dispatched if: 839 840* The size of the dispatch group is smaller than processor's dispatch width. 841* There are enough entries in the reorder buffer. 842* There are enough physical registers to do register renaming. 843* The schedulers are not full. 844 845Scheduling models can optionally specify which register files are available on 846the processor. :program:`llvm-mca` uses that information to initialize register 847file descriptors. Users can limit the number of physical registers that are 848globally available for register renaming by using the command option 849``-register-file-size``. A value of zero for this option means *unbounded*. By 850knowing how many registers are available for renaming, the tool can predict 851dispatch stalls caused by the lack of physical registers. 852 853The number of reorder buffer entries consumed by an instruction depends on the 854number of micro-opcodes specified for that instruction by the target scheduling 855model. The reorder buffer is responsible for tracking the progress of 856instructions that are "in-flight", and retiring them in program order. The 857number of entries in the reorder buffer defaults to the value specified by field 858`MicroOpBufferSize` in the target scheduling model. 859 860Instructions that are dispatched to the schedulers consume scheduler buffer 861entries. :program:`llvm-mca` queries the scheduling model to determine the set 862of buffered resources consumed by an instruction. Buffered resources are 863treated like scheduler resources. 864 865Instruction Issue 866""""""""""""""""" 867Each processor scheduler implements a buffer of instructions. An instruction 868has to wait in the scheduler's buffer until input register operands become 869available. Only at that point, does the instruction becomes eligible for 870execution and may be issued (potentially out-of-order) for execution. 871Instruction latencies are computed by :program:`llvm-mca` with the help of the 872scheduling model. 873 874:program:`llvm-mca`'s scheduler is designed to simulate multiple processor 875schedulers. The scheduler is responsible for tracking data dependencies, and 876dynamically selecting which processor resources are consumed by instructions. 877It delegates the management of processor resource units and resource groups to a 878resource manager. The resource manager is responsible for selecting resource 879units that are consumed by instructions. For example, if an instruction 880consumes 1cy of a resource group, the resource manager selects one of the 881available units from the group; by default, the resource manager uses a 882round-robin selector to guarantee that resource usage is uniformly distributed 883between all units of a group. 884 885:program:`llvm-mca`'s scheduler internally groups instructions into three sets: 886 887* WaitSet: a set of instructions whose operands are not ready. 888* ReadySet: a set of instructions ready to execute. 889* IssuedSet: a set of instructions executing. 890 891Depending on the operands availability, instructions that are dispatched to the 892scheduler are either placed into the WaitSet or into the ReadySet. 893 894Every cycle, the scheduler checks if instructions can be moved from the WaitSet 895to the ReadySet, and if instructions from the ReadySet can be issued to the 896underlying pipelines. The algorithm prioritizes older instructions over younger 897instructions. 898 899Write-Back and Retire Stage 900""""""""""""""""""""""""""" 901Issued instructions are moved from the ReadySet to the IssuedSet. There, 902instructions wait until they reach the write-back stage. At that point, they 903get removed from the queue and the retire control unit is notified. 904 905When instructions are executed, the retire control unit flags the instruction as 906"ready to retire." 907 908Instructions are retired in program order. The register file is notified of the 909retirement so that it can free the physical registers that were allocated for 910the instruction during the register renaming stage. 911 912Load/Store Unit and Memory Consistency Model 913"""""""""""""""""""""""""""""""""""""""""""" 914To simulate an out-of-order execution of memory operations, :program:`llvm-mca` 915utilizes a simulated load/store unit (LSUnit) to simulate the speculative 916execution of loads and stores. 917 918Each load (or store) consumes an entry in the load (or store) queue. Users can 919specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the 920load and store queues respectively. The queues are unbounded by default. 921 922The LSUnit implements a relaxed consistency model for memory loads and stores. 923The rules are: 924 9251. A younger load is allowed to pass an older load only if there are no 926 intervening stores or barriers between the two loads. 9272. A younger load is allowed to pass an older store provided that the load does 928 not alias with the store. 9293. A younger store is not allowed to pass an older store. 9304. A younger store is not allowed to pass an older load. 931 932By default, the LSUnit optimistically assumes that loads do not alias 933(`-noalias=true`) store operations. Under this assumption, younger loads are 934always allowed to pass older stores. Essentially, the LSUnit does not attempt 935to run any alias analysis to predict when loads and stores do not alias with 936each other. 937 938Note that, in the case of write-combining memory, rule 3 could be relaxed to 939allow reordering of non-aliasing store operations. That being said, at the 940moment, there is no way to further relax the memory model (``-noalias`` is the 941only option). Essentially, there is no option to specify a different memory 942type (e.g., write-back, write-combining, write-through; etc.) and consequently 943to weaken, or strengthen, the memory model. 944 945Other limitations are: 946 947* The LSUnit does not know when store-to-load forwarding may occur. 948* The LSUnit does not know anything about cache hierarchy and memory types. 949* The LSUnit does not know how to identify serializing operations and memory 950 fences. 951 952The LSUnit does not attempt to predict if a load or store hits or misses the L1 953cache. It only knows if an instruction "MayLoad" and/or "MayStore." For 954loads, the scheduling model provides an "optimistic" load-to-use latency (which 955usually matches the load-to-use latency for when there is a hit in the L1D). 956 957:program:`llvm-mca` does not (on its own) know about serializing operations or 958memory-barrier like instructions. The LSUnit used to conservatively use an 959instruction's "MayLoad", "MayStore", and unmodeled side effects flags to 960determine whether an instruction should be treated as a memory-barrier. This was 961inaccurate in general and was changed so that now each instruction has an 962IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and 963default to false for every instruction. If any instruction should have either of 964these flags set, it should be done within the target's InstrPostProcess class. 965For an example, look at the `X86InstrPostProcess::postProcessInstruction` method 966within `llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp`. 967 968A load/store barrier consumes one entry of the load/store queue. A load/store 969barrier enforces ordering of loads/stores. A younger load cannot pass a load 970barrier. Also, a younger store cannot pass a store barrier. A younger load 971has to wait for the memory/load barrier to execute. A load/store barrier is 972"executed" when it becomes the oldest entry in the load/store queue(s). That 973also means, by construction, all of the older loads/stores have been executed. 974 975In conclusion, the full set of load/store consistency rules are: 976 977#. A store may not pass a previous store. 978#. A store may not pass a previous load (regardless of ``-noalias``). 979#. A store has to wait until an older store barrier is fully executed. 980#. A load may pass a previous load. 981#. A load may not pass a previous store unless ``-noalias`` is set. 982#. A load has to wait until an older load barrier is fully executed. 983 984In-order Issue and Execute 985"""""""""""""""""""""""""""""""""""" 986In-order processors are modelled as a single ``InOrderIssueStage`` stage. It 987bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as 988soon as their operand registers are available and resource requirements are 989met. Multiple instructions can be issued in one cycle according to the value of 990the ``IssueWidth`` parameter in LLVM's scheduling model. 991 992Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to 993retire. :program:`llvm-mca` ensures that writes are committed in-order. However, 994an instruction is allowed to commit writes and retire out-of-order if 995``RetireOOO`` property is true for at least one of its writes. 996 997Custom Behaviour 998"""""""""""""""""""""""""""""""""""" 999Due to certain instructions not being expressed perfectly within their 1000scheduling model, :program:`llvm-mca` isn't always able to simulate them 1001perfectly. Modifying the scheduling model isn't always a viable 1002option though (maybe because the instruction is modeled incorrectly on 1003purpose or the instruction's behaviour is quite complex). The 1004CustomBehaviour class can be used in these cases to enforce proper 1005instruction modeling (often by customizing data dependencies and detecting 1006hazards that :program:`llvm-mca` has no way of knowing about). 1007 1008:program:`llvm-mca` comes with one generic and multiple target specific 1009CustomBehaviour classes. The generic class will be used if the ``-disable-cb`` 1010flag is used or if a target specific CustomBehaviour class doesn't exist for 1011that target. (The generic class does nothing.) Currently, the CustomBehaviour 1012class is only a part of the in-order pipeline, but there are plans to add it 1013to the out-of-order pipeline in the future. 1014 1015CustomBehaviour's main method is `checkCustomHazard()` which uses the 1016current instruction and a list of all instructions still executing within 1017the pipeline to determine if the current instruction should be dispatched. 1018As output, the method returns an integer representing the number of cycles 1019that the current instruction must stall for (this can be an underestimate 1020if you don't know the exact number and a value of 0 represents no stall). 1021 1022If you'd like to add a CustomBehaviour class for a target that doesn't 1023already have one, refer to an existing implementation to see how to set it 1024up. The classes are implemented within the target specific backend (for 1025example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols. 1026 1027Custom Views 1028"""""""""""""""""""""""""""""""""""" 1029:program:`llvm-mca` comes with several Views such as the Timeline View and 1030Summary View. These Views are generic and can work with most (if not all) 1031targets. If you wish to add a new View to :program:`llvm-mca` and it does not 1032require any backend functionality that is not already exposed through MC layer 1033classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the 1034`/tools/llvm-mca/View/` directory. However, if your new View is target specific 1035AND requires unexposed backend symbols or functionality, you can define it in 1036the `/lib/Target/<TargetName>/MCA/` directory. 1037 1038To enable this target specific View, you will have to use this target's 1039CustomBehaviour class to override the `CustomBehaviour::getViews()` methods. 1040There are 3 variations of these methods based on where you want your View to 1041appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and 1042`getEndViews()`. These methods returns a vector of Views so you will want to 1043return a vector containing all of the target specific Views for the target in 1044question. 1045 1046Because these target specific (and backend dependent) Views require the 1047`CustomBehaviour::getViews()` variants, these Views will not be enabled if 1048the `-disable-cb` flag is used. 1049 1050Enabling these custom Views does not affect the non-custom (generic) Views. 1051Continue to use the usual command line arguments to enable / disable those 1052Views. 1053