1============================= 2Offloading Design & Internals 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes the Clang driver and code generation steps for creating 12offloading applications. Clang supports offloading to various architectures 13using programming models like CUDA, HIP, and OpenMP. The purpose of this 14document is to illustrate the steps necessary to create an offloading 15application using Clang. 16 17OpenMP Offloading 18================= 19 20.. note:: 21 This documentation describes Clang's behavior using the new offloading 22 driver. This currently must be enabled manually using 23 ``-fopenmp-new-driver``. 24 25Clang supports OpenMP target offloading to several different architectures such 26as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by 27Clang and then executed using the ``libomptarget`` runtime and the associated 28plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section 29describes the steps necessary to create a functioning device image that can be 30loaded by the OpenMP runtime. More information on the OpenMP runtimes can be 31found at the `OpenMP documentation page <https://openmp.llvm.org>`__. 32 33.. _Offloading Overview: 34 35Offloading Overview 36------------------- 37 38The goal of offloading compilation is to create an executable device image that 39can be run on the target device. OpenMP offloading creates executable images by 40compiling the input file for both the host and the target device. The output 41from the device phase then needs to be embedded into the host to create a fat 42object. A special tool then needs to extract the device code from the fat 43objects, run the device linking step, and embed the final image in a symbol the 44host runtime library can use to register the library and access the symbols on 45the device. 46 47Compilation Process 48^^^^^^^^^^^^^^^^^^^ 49 50The compiler performs the following high-level actions to generate OpenMP 51offloading code: 52 53* Compile the input file for the host to produce a bitcode file. Lower ``#pragma 54 omp target`` declarations to :ref:`offloading entries <Generating Offloading 55 Entries>` and create metadata to indicate which entries are on the device. 56* Compile the input file for the target :ref:`device <Device Compilation>` using 57 the :ref:`offloading entry <Generating Offloading Entries>` metadata created 58 by the host. 59* Link the OpenMP device runtime library and run the backend to create a device 60 object file. 61* Run the backend on the host bitcode file and create a :ref:`fat object file 62 <Creating Fat Objects>` using the device object file. 63* Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>` 64 and extract the device objects. Run the device linking action on the extracted 65 objects. 66* :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>` 67 and :ref:`offload entries <Generating Offloading Entries>` in a symbol that 68 can be accessed by the host. 69* Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and 70 run the host linking action. Link with ``libomptarget`` to register and 71 execute the images. 72 73 .. _Generating Offloading Entries: 74 75Generating Offloading Entries 76----------------------------- 77 78The first step in compilation is to generate offloading entries for the host. 79This information is used to identify function kernels or global values that will 80be provided by the device. Blocks contained in a ``#pragma omp target`` or 81symbols inside a ``#pragma omp declare target`` directive will have offloading 82entries generated. The following table shows the :ref:`offload entry structure 83<table-tgt_offload_entry_structure>`. 84 85 .. table:: __tgt_offload_entry Structure 86 :name: table-tgt_offload_entry_structure 87 88 +---------+------------+------------------------------------------------------------------------+ 89 | Type | Identifier | Description | 90 +=========+============+========================================================================+ 91 | void* | addr | Address of global symbol within device image (function or global) | 92 +---------+------------+------------------------------------------------------------------------+ 93 | char* | name | Name of the symbol | 94 +---------+------------+------------------------------------------------------------------------+ 95 | size_t | size | Size of the entry info (0 if it is a function) | 96 +---------+------------+------------------------------------------------------------------------+ 97 | int32_t | flags | Flags associated with the entry (see :ref:`table-offload_entry_flags`) | 98 +---------+------------+------------------------------------------------------------------------+ 99 | int32_t | reserved | Reserved, to be used by the runtime library. | 100 +---------+------------+------------------------------------------------------------------------+ 101 102The address of the global symbol will be set to the device pointer value by the 103runtime once the device image is loaded. The flags are set to indicate the 104handling required for the offloading entry. If the offloading entry is an entry 105to a target region it can have one of the following :ref:`entry flags 106<table-offload_entry_flags>`. 107 108 .. table:: Target Region Entry Flags 109 :name: table-offload_entry_flags 110 111 +----------------------------------+-------+-----------------------------------------+ 112 | Name | Value | Description | 113 +==================================+=======+=========================================+ 114 | OMPTargetRegionEntryTargetRegion | 0x00 | Mark the entry as generic target region | 115 +----------------------------------+-------+-----------------------------------------+ 116 | OMPTargetRegionEntryCtor | 0x02 | Mark the entry as a global constructor | 117 +----------------------------------+-------+-----------------------------------------+ 118 | OMPTargetRegionEntryDtor | 0x04 | Mark the entry as a global destructor | 119 +----------------------------------+-------+-----------------------------------------+ 120 121If the offloading entry is a global variable, indicated by a non-zero size, it 122will instead have one of the following :ref:`global 123<table-offload_global_flags>` flags. 124 125 .. table:: Target Region Global 126 :name: table-offload_global_flags 127 128 +-----------------------------+-------+---------------------------------------------------------------+ 129 | Name | Value | Description | 130 +=============================+=======+===============================================================+ 131 | OMPTargetGlobalVarEntryTo | 0x00 | Mark the entry as a 'to' attribute (w.r.t. the to clause) | 132 +-----------------------------+-------+---------------------------------------------------------------+ 133 | OMPTargetGlobalVarEntryLink | 0x01 | Mark the entry as a 'link' attribute (w.r.t. the link clause) | 134 +-----------------------------+-------+---------------------------------------------------------------+ 135 136The target offload entries are used by the runtime to access the device kernels 137and globals that will be provided by the final device image. Each offloading 138entry is set to use the ``omp_offloading_entries`` section. When the final 139application is created the linker will provide the 140``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols 141which are used to create the :ref:`final image <Device Binary Wrapping>`. 142 143This information is used by the device compilation stage to determine which 144symbols need to be exported from the device. We use the ``omp_offload.info`` 145metadata node to pass this information device compilation stage. 146 147Accessing Entries on the Device 148^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 149 150Accessing the entries in the device is done using the address field in the 151:ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set 152the address to the pointer associated with the device image during runtime 153initialization. This is used to call the corresponding kernel function when 154entering a ``#pragma omp target`` region. For variables, the runtime maintains a 155table mapping host pointers to device pointers. Global variables inside a 156``#pragma omp target declare`` directive are first initialized to the host's 157address. Once the device address is initialized we insert it into the table to 158map the host address to the device address. 159 160Debugging Information 161^^^^^^^^^^^^^^^^^^^^^ 162 163We generate structures to hold debugging information that is passed to 164``libomptarget``. This allows the front-end to generate information the runtime 165library uses for more informative error messages. This is done using the 166standard :ref:`identifier structure <table-ident_t_structure>` used in 167``libomp`` and ``libomptarget``. This is used to pass information and source 168locations to the runtime. 169 170 .. table:: ident_t Structure 171 :name: table-ident_t_structure 172 173 +---------+------------+-----------------------------------------------------------------------------+ 174 | Type | Identifier | Description | 175 +=========+============+=============================================================================+ 176 | int32_t | reserved | Reserved, to be used by the runtime library. | 177 +---------+------------+-----------------------------------------------------------------------------+ 178 | int32_t | flags | Flags used to indicate some features, mostly unused. | 179 +---------+------------+-----------------------------------------------------------------------------+ 180 | int32_t | reserved | Reserved, to be used by the runtime library. | 181 +---------+------------+-----------------------------------------------------------------------------+ 182 | int32_t | reserved | Reserved, to be used by the runtime library. | 183 +---------+------------+-----------------------------------------------------------------------------+ 184 | char* | psource | Program source information, stored as ";filename;function;line;column;;\\0" | 185 +---------+------------+-----------------------------------------------------------------------------+ 186 187If debugging information is enabled, we will also create strings to indicate the 188names and declarations of variables mapped in target regions. These have the 189same format as the source location in the :ref:`identifier structure 190<table-ident_t_structure>`, but the function name is replaced with the variable 191name. 192 193.. _Device Compilation: 194 195Offload Device Compilation 196-------------------------- 197 198The input file is compiled for each active device toolchain. The device 199compilation stage is performed differently from the host stage. Namely, we do 200not generate any offloading entries. This is set by passing the 201``-fopenmp-is-device`` flag to the front-end. We use the host bitcode to 202determine which symbols to export from the device. The bitcode file is passed in 203from the previous stage using the ``-fopenmp-host-ir-file-path`` flag. 204Compilation is otherwise performed as it would be for any other target triple. 205 206When compiling for the OpenMP device, we set the visibility of all device 207symbols to be ``protected`` by default. This improves performance and prevents a 208class of errors where a symbol in the target device could preempt a host 209library. 210 211The OpenMP runtime library is linked in during compilation to provide the 212implementations for standard OpenMP functionality. For GPU targets this is done 213by linking in a special bitcode library during compilation, (e.g. 214``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag. 215Other device libraries, such as CUDA's libdevice, are also linked this way. If 216the target is a standard architecture with an existing ``libomp`` 217implementation, that will be linked instead. Finally, device tools are used to 218create a relocatable device object file that can be embedded in the host. 219 220.. _Creating Fat Objects: 221 222Creating Fat Objects 223-------------------- 224 225A fat binary is a binary file that contains information intended for another 226device. We create a fat object by embedding the output of the device compilation 227stage into the host as a named section. The output from the device compilation 228is passed to the host backend using the ``-fembed-offload-object`` flag. This 229inserts the object as a global in the host's IR. The section name contains the 230target triple and architecture that the data corresponds to for later use. 231Typically we will also add an extra string to the section name to prevent it 232from being merged with other sections if the user performs relocatable linking 233on the object. 234 235.. code-block:: llvm 236 237 @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading.nvptx64.sm_70." 238 239The device code will then be placed in the corresponding section one the backend 240is run on the host, creating a fat object. Using fat objects allows us to treat 241offloading objects as standard host objects. The final object file should 242contain the following :ref:`offloading sections <table-offloading_sections>`. We 243will use this information when :ref:`Device Linking`. 244 245 .. table:: Offloading Sections 246 :name: table-offloading_sections 247 248 +----------------------------------+--------------------------------------------------------------------+ 249 | Section | Description | 250 +==================================+====================================================================+ 251 | omp_offloading_entries | Offloading entry information (see :ref:`table-tgt_offload_entry`) | 252 +----------------------------------+--------------------------------------------------------------------+ 253 | .llvm.offloading.<triple>.<arch> | Embedded device object file for the target device and architecture | 254 +----------------------------------+--------------------------------------------------------------------+ 255 256.. _Device Linking: 257 258Linking Target Device Code 259-------------------------- 260 261Objects containing :ref:`table-offloading_sections` require special handling to 262create an executable device image. This is done using a Clang tool, see 263:doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper 264over the host linking job. It scans the input object files for the offloading 265sections and runs the appropriate device linking action. The linked device image 266is then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load the 267device image and link it with the host. 268 269The linker wrapper tool supports linking bitcode files through link time 270optimization (LTO). This is used whenever the object files embedded in the host 271contain LLVM bitcode. Bitcode will be embedded for architectures that do not 272support a relocatable object format, such as AMDGPU or SPIR-V, or if the user 273requested it using the ``-foffload-lto`` flag. 274 275.. _Device Binary Wrapping: 276 277Device Binary Wrapping 278---------------------- 279 280Various structures and functions are used to create the information necessary to 281offload code on the device. We use the :ref:`linked device executable <Device 282Linking>` with the corresponding offloading entries to create the symbols 283necessary to load and execute the device image. 284 285Structure Types 286^^^^^^^^^^^^^^^ 287 288Several different structures are used to store offloading information. The 289:ref:`device image structure <table-device_image_structure>` stores a single 290linked device image and its associated offloading entries. The offloading 291entries are stored using the ``__start_omp_offloading_entries`` and 292``__stop_omp_offloading_entries`` symbols generated by the linker using the 293:ref:`table-tgt_offload_entry`. 294 295 .. table:: __tgt_device_image Structure 296 :name: table-device_image_structure 297 298 +----------------------+--------------+----------------------------------------+ 299 | Type | Identifier | Description | 300 +======================+==============+========================================+ 301 | void* | ImageStart | Pointer to the target code start | 302 +----------------------+--------------+----------------------------------------+ 303 | void* | ImageEnd | Pointer to the target code end | 304 +----------------------+--------------+----------------------------------------+ 305 | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries | 306 +----------------------+--------------+----------------------------------------+ 307 | __tgt_offload_entry* | EntriesEnd | End of table (non inclusive) | 308 +----------------------+--------------+----------------------------------------+ 309 310The target :ref:`target binary descriptor <table-target_binary_descriptor>` is 311used to store all binary images and offloading entries in an array. 312 313 .. table:: __tgt_bin_desc Structure 314 :name: table-target_binary_descriptor 315 316 +----------------------+------------------+------------------------------------------+ 317 | Type | Identifier | Description | 318 +======================+==================+==========================================+ 319 | int32_t | NumDeviceImages | Number of device types supported | 320 +----------------------+------------------+------------------------------------------+ 321 | __tgt_device_image* | DeviceImages | Array of device images (1 per dev. type) | 322 +----------------------+------------------+------------------------------------------+ 323 | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries | 324 +----------------------+------------------+------------------------------------------+ 325 | __tgt_offload_entry* | HostEntriesEnd | End of table (non inclusive) | 326 +----------------------+------------------+------------------------------------------+ 327 328Global Variables 329---------------- 330 331:ref:`table-global_variables` lists various global variables, along with their 332type and their explicit ELF sections, which are used to store device images and 333related symbols. 334 335 .. table:: Global Variables 336 :name: table-global_variables 337 338 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 339 | Variable | Type | ELF Section | Description | 340 +================================+=====================+=========================+=========================================================+ 341 | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table. | 342 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 343 | __stop_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table. | 344 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 345 | __dummy.omp_offloading.entry | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries | 346 | | | | section to force linker to define begin/end | 347 | | | | symbols defined above. | 348 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 349 | .omp_offloading.device_image | __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image. | 350 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 351 | .omp_offloading.device_image.N | __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image. | 352 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 353 | .omp_offloading.device_images | __tgt_device_image | .omp_offloading_entries | Array of images. | 354 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 355 | .omp_offloading.descriptor | __tgt_bin_desc | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) | 356 +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ 357 358.. _binary_descriptor: 359 360Binary Descriptor for Device Images 361^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 362 363This object is passed to the offloading runtime at program startup and it 364describes all device images available in the executable or shared library. It 365is defined as follows: 366 367.. code-block:: c 368 369 __attribute__((visibility("hidden"))) 370 extern __tgt_offload_entry *__start_omp_offloading_entries; 371 __attribute__((visibility("hidden"))) 372 extern __tgt_offload_entry *__stop_omp_offloading_entries; 373 static const char Image0[] = { <Bufs.front() contents> }; 374 ... 375 static const char ImageN[] = { <Bufs.back() contents> }; 376 static const __tgt_device_image Images[] = { 377 { 378 Image0, /*ImageStart*/ 379 Image0 + sizeof(Image0), /*ImageEnd*/ 380 __start_omp_offloading_entries, /*EntriesBegin*/ 381 __stop_omp_offloading_entries /*EntriesEnd*/ 382 }, 383 ... 384 { 385 ImageN, /*ImageStart*/ 386 ImageN + sizeof(ImageN), /*ImageEnd*/ 387 __start_omp_offloading_entries, /*EntriesBegin*/ 388 __stop_omp_offloading_entries /*EntriesEnd*/ 389 } 390 }; 391 static const __tgt_bin_desc BinDesc = { 392 sizeof(Images) / sizeof(Images[0]), /*NumDeviceImages*/ 393 Images, /*DeviceImages*/ 394 __start_omp_offloading_entries, /*HostEntriesBegin*/ 395 __stop_omp_offloading_entries /*HostEntriesEnd*/ 396 }; 397 398 399Global Constructor and Destructor 400^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 401 402The global constructor (``.omp_offloading.descriptor_reg()``) registers the 403device images with the runtime by calling the ``__tgt_register_lib()`` runtime 404function. The constructor is explicitly defined in ``.text.startup`` section and 405is run once when the program starts. Similarly, the global destructor 406(``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for 407the destructor and is also defined in ``.text.startup`` section and run when the 408program exits. 409 410Offloading Example 411------------------ 412 413This section contains a simple example of generating offloading code using 414OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine. 415 416.. code-block:: c++ 417 418 #include <complex> 419 420 using complex = std::complex<double>; 421 422 void zaxpy(complex *X, complex *Y, complex D, std::size_t N) { 423 #pragma omp target teams distribute parallel for 424 for (std::size_t i = 0; i < N; ++i) 425 Y[i] = D * X[i] + Y[i]; 426 } 427 428 int main() { 429 const std::size_t N = 1024; 430 complex X[N], Y[N], D; 431 #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N]) 432 zaxpy(X, Y, D, N); 433 } 434 435This code is compiled using the following Clang flags. 436 437.. code-block:: console 438 439 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c 440 441The output section in the object file can be seen using the ``readelf`` utility 442 443.. code-block:: text 444 445 $ llvm-readelf -WS zaxpy.o 446 [Nr] Name Type 447 ... 448 [34] omp_offloading_entries PROGBITS 449 [35] .llvm.offloading.nvptx64-nvidia-cuda.sm_70 PROGBITS 450 451Compiling this file again will invoke the ``clang-linker-wrapper`` utility to 452extract and link the device code stored at the section named 453``.llvm.offloading.nvptx64-nvidia-cuda.sm_70`` and then use entries stored in 454the section named ``omp_offloading_entries`` to create the symbols necessary for 455``libomptarget`` to register the device image and call the entry function. 456 457.. code-block:: console 458 459 $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy 460 $ ./zaxpy 461 462We can see the steps created by clang to generate the offloading code using the 463``-ccc-print-phases`` option in Clang. This matches the description in 464:ref:`Offloading Overview`. 465 466.. code-block:: console 467 468 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp 469 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc" 470 # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s" 471 # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o" 472 # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o" 473 # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out" 474