1# Optimizing Clang : A Practical Example of Applying BOLT 2 3## Preface 4 5*BOLT* (Binary Optimization and Layout Tool) is designed to improve the application 6performance by laying out code in a manner that helps CPU better utilize its caching and 7branch predicting resources. 8 9The most obvious candidates for BOLT optimizations 10are programs that suffer from many instruction cache and iTLB misses, such as 11large applications measuring over hundreds of megabytes in size. However, medium-sized 12programs can benefit too. Clang, one of the most popular open-source C/C++ compilers, 13is a good example of the latter. Its code size could easily be in the order of tens of megabytes. 14As we will see, the Clang binary suffers from many instruction cache 15misses and can be significantly improved with BOLT, even on top of profile-guided and 16link-time optimizations. 17 18In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to 19apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where 20the compile-time performance gains are coming from, and verify that the speed-ups are 21sustainable while building other applications. 22 23## Building Clang 24 25The process of getting Clang sources and performing the build is very similar to the 26one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps 27on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section. 28 29The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during 30the final link. This option saves relocation metadata in the executable file, but does not affect 31the generated code in any way. 32 33## Optimizing Clang with BOLT 34 35We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto). 36Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`. 37 38Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use 39Clang/LLVM sources for that. 40Collecting accurate profile requires running `perf` on a hardware that 41implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to 42collect the accurate profile in a virtualized environment, e.g. in the cloud. 43We do support regular sampling profiles, but the performance 44improvements are expected to be more modest. 45 46```bash 47$ mkdir ${TOPLEV}/stage3 48$ cd ${TOPLEV}/stage3 49$ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/ 50$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \ 51 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ 52 -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install 53$ perf record -e cycles:u -j any,u -- ninja clang 54``` 55 56Once the last command is finished, it will create a `perf.data` file larger than 10GiB. 57We will first convert this profile into a more compact aggregated 58form suitable to be consumed by BOLT: 59```bash 60 $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml 61``` 62Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that 63`clang` and `clang++` are symlinking to. The next step will optimize Clang using 64the generated profile: 65```bash 66$ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \ 67 -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 \ 68 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack 69``` 70The output will look similar to the one below: 71```t 72... 73BOLT-INFO: enabling relocation mode 74BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile. 75... 76BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables. 77BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile. 78BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions 79... 80 660155947 : executed forward branches (-2.3%) 81 48252553 : taken forward branches (-57.2%) 82 129897961 : executed backward branches (+13.8%) 83 52389551 : taken backward branches (-19.5%) 84 35650038 : executed unconditional branches (-33.2%) 85 128338874 : all function calls (=) 86 19010563 : indirect calls (=) 87 9918250 : PLT calls (=) 88 6113398840 : executed instructions (-0.6%) 89 1519537463 : executed load instructions (=) 90 943321306 : executed store instructions (=) 91 20467109 : taken jump table branches (=) 92 825703946 : total branches (-2.1%) 93 136292142 : taken branches (-41.1%) 94 689411804 : non-taken conditional branches (+12.6%) 95 100642104 : taken conditional branches (-43.4%) 96 790053908 : all conditional branches (=) 97... 98``` 99The statistics in the output is based on the LBR profile collected with `perf`, and since we were using 100the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional 101 branches` is a good indication that BOLT was able to straighten out the code even after PGO. 102 103## Measuring Compile-time Improvement 104 105`clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang: 106```bash 107$ mv $CPATH/clang-7 $CPATH/clang-7.org 108$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 109``` 110Doing a new build of Clang using the new binary shows a significant overall 111build time reduction on a 48-core Haswell system: 112```bash 113$ ln -fs $CPATH/clang-7.org $CPATH/clang-7 114$ ninja clean && /bin/time -f %e ninja clang -j48 115202.72 116$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 117$ ninja clean && /bin/time -f %e ninja clang -j48 118180.11 119``` 120That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build. 121Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker. 122Compilation time improvements for individual files differ, and speedups over 15% are not uncommon. 123If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds), 124the gains we see are over 50 seconds (25%), 125but, as expected, the result is still slower than *PGO+LTO+BOLT* build. 126 127## Source of the Wins 128 129We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`: 130```bash 131$ ln -fs $CPATH/clang-7.org $CPATH/clang-7 132$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 133 ... 134 16,366,101,626,647 instructions 135 359,996,216,537 L1-icache-misses 136``` 137That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application 138has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT. 139Now let's see how many misses are in the BOLTed binary: 140```bash 141$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 142$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 143 ... 144 16,319,818,488,769 instructions 145 244,888,677,972 L1-icache-misses 146``` 147The number of misses per thousand instructions went down from 22 to 15, significantly reducing 148the number of stalls in the CPU front-end. 149Notice how the number of executed instructions stayed roughly the same. That's because we didn't 150run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses, 151BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3. 152 153## Using Clang for Other Applications 154 155We have collected profile for Clang using its own source code. Would it be enough to speed up 156the compilation of other projects? We picked `mysqld`, an open-source database, to do the test. 157 158On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds. 159That's a noticeable improvement, but not as significant as the one we saw on Clang itself. 160This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22. 161Another reason is that Clang is run with a different set of options while building `mysqld` compared 162to the training run. 163 164Different options exercise different code paths, and 165if we trained without a specific option, we may have misplaced parts of the code responsible for handling it. 166To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile 167using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able 168to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang. 169The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025. 170 171Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set. 172 173## Summary 174 175In this tutorial we demonstrated how to use BOLT to improve the 176performance of the Clang compiler. Similarly, BOLT could be used to improve the performance 177of GCC, or any other application suffering from a high number of instruction 178cache misses. 179 180---- 181# Appendix 182 183## Bootstrapping Clang-7 with PGO and LTO 184 185Below we describe detailed steps to build Clang, and make it ready for BOLT 186optimizations. If you already have the build setup, you can skip this section, 187except for the last step that adds `-Wl,-q` linker flag to the final build. 188 189### Getting Clang-7 Sources 190 191Set `$TOPLEV` to the directory of your preference where you would like to do 192builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70` 193branch of LLVM monorepo: 194```bash 195$ mkdir ${TOPLEV} 196$ cd ${TOPLEV} 197$ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git 198``` 199 200### Building Stage 1 Compiler 201 202Stage 1 will be the first build we are going to do, and we will be using the 203default system compiler to build Clang. If your system lacks a compiler, use 204your distribution package manager to install one that supports C++11. In this 205example we are going to use GCC. In addition to the compiler, you will need the 206`cmake` and `ninja` packages. Note that we disable the build of certain 207compiler-rt components that are known to cause build issues at release/7.x. 208```bash 209$ mkdir ${TOPLEV}/stage1 210$ cd ${TOPLEV}/stage1 211$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ 212 -DCMAKE_BUILD_TYPE=Release \ 213 -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \ 214 -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" \ 215 -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \ 216 -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \ 217 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install 218$ ninja install 219``` 220 221### Building Stage 2 Compiler With Instrumentation 222 223Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with 224profile generation capabilities: 225```bash 226$ mkdir ${TOPLEV}/stage2-prof-gen 227$ cd ${TOPLEV}/stage2-prof-gen 228$ CPATH=${TOPLEV}/stage1/install/bin/ 229$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ 230 -DCMAKE_BUILD_TYPE=Release \ 231 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ 232 -DLLVM_ENABLE_PROJECTS="clang;lld" \ 233 -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \ 234 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install 235$ ninja install 236``` 237 238### Generating Profile for PGO 239 240While there are many ways to obtain the profile data, we are going to use the 241source code already at our disposal, i.e. we are going to collect the profile 242while building Clang itself: 243```bash 244$ mkdir ${TOPLEV}/stage3-train 245$ cd ${TOPLEV}/stage3-train 246$ CPATH=${TOPLEV}/stage2-prof-gen/install/bin 247$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ 248 -DCMAKE_BUILD_TYPE=Release \ 249 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ 250 -DLLVM_ENABLE_PROJECTS="clang" \ 251 -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install 252$ ninja clang 253``` 254Once the build is completed, the profile files will be saved under 255`${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be 256passed back into Clang: 257```bash 258$ cd ${TOPLEV}/stage2-prof-gen/profiles 259$ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata * 260``` 261 262### Building Clang with PGO and LTO 263 264Now the profile can be used to guide optimizations to produce better code for 265our scenario, i.e. building Clang. We will also enable link-time optimizations 266to allow cross-module inlining and other optimizations. Finally, we are going to 267add one extra step that is useful for BOLT: a linker flag instructing it to 268preserve relocations in the output binary. Note that this flag does not affect 269the generated code or data used at runtime, it only writes metadata to the file 270on disk: 271```bash 272$ mkdir ${TOPLEV}/stage2-prof-use-lto 273$ cd ${TOPLEV}/stage2-prof-use-lto 274$ CPATH=${TOPLEV}/stage1/install/bin/ 275$ export LDFLAGS="-Wl,-q" 276$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ 277 -DCMAKE_BUILD_TYPE=Release \ 278 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ 279 -DLLVM_ENABLE_PROJECTS="clang;lld" \ 280 -DLLVM_ENABLE_LTO=Full \ 281 -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \ 282 -DLLVM_USE_LINKER=lld \ 283 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install 284$ ninja install 285``` 286Now we have a Clang compiler that can build itself much faster. As we will see, 287it builds other applications faster as well, and, with BOLT, the compile time 288can be improved even further. 289