1=============================
2Offloading Design & Internals
3=============================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11This document describes the Clang driver and code generation steps for creating
12offloading applications. Clang supports offloading to various architectures
13using programming models like CUDA, HIP, and OpenMP. The purpose of this
14document is to illustrate the steps necessary to create an offloading
15application using Clang.
16
17OpenMP Offloading
18=================
19
20.. note::
21   This documentation describes Clang's behavior using the new offloading
22   driver. This currently must be enabled manually using
23   ``-fopenmp-new-driver``.
24
25Clang supports OpenMP target offloading to several different architectures such
26as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by
27Clang and then executed using the ``libomptarget`` runtime and the associated
28plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section
29describes the steps necessary to create a functioning device image that can be
30loaded by the OpenMP runtime.  More information on the OpenMP runtimes can be
31found at the `OpenMP documentation page <https://openmp.llvm.org>`__.
32
33.. _Offloading Overview:
34
35Offloading Overview
36-------------------
37
38The goal of offloading compilation is to create an executable device image that
39can be run on the target device. OpenMP offloading creates executable images by
40compiling the input file for both the host and the target device. The output
41from the device phase then needs to be embedded into the host to create a fat
42object. A special tool then needs to extract the device code from the fat
43objects, run the device linking step, and embed the final image in a symbol the
44host runtime library can use to register the library and access the symbols on
45the device.
46
47Compilation Process
48^^^^^^^^^^^^^^^^^^^
49
50The compiler performs the following high-level actions to generate OpenMP
51offloading code:
52
53* Compile the input file for the host to produce a bitcode file. Lower ``#pragma
54  omp target`` declarations to :ref:`offloading entries <Generating Offloading
55  Entries>` and create metadata to indicate which entries are on the device.
56* Compile the input file for the target :ref:`device <Device Compilation>` using
57  the :ref:`offloading entry <Generating Offloading Entries>` metadata created
58  by the host.
59* Link the OpenMP device runtime library and run the backend to create a device
60  object file.
61* Run the backend on the host bitcode file and create a :ref:`fat object file
62  <Creating Fat Objects>` using the device object file.
63* Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>`
64  and extract the device objects. Run the device linking action on the extracted
65  objects.
66* :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>`
67  and :ref:`offload entries <Generating Offloading Entries>` in a symbol that
68  can be accessed by the host.
69* Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and
70  run the host linking action. Link with ``libomptarget`` to register and
71  execute the images.
72
73   .. _Generating Offloading Entries:
74
75Generating Offloading Entries
76-----------------------------
77
78The first step in compilation is to generate offloading entries for the host.
79This information is used to identify function kernels or global values that will
80be provided by the device. Blocks contained in a ``#pragma omp target`` or
81symbols inside a ``#pragma omp declare target`` directive will have offloading
82entries generated. The following table shows the :ref:`offload entry structure
83<table-tgt_offload_entry_structure>`.
84
85  .. table:: __tgt_offload_entry Structure
86    :name: table-tgt_offload_entry_structure
87
88    +---------+------------+------------------------------------------------------------------------+
89    |   Type  | Identifier | Description                                                            |
90    +=========+============+========================================================================+
91    |  void*  |    addr    | Address of global symbol within device image (function or global)      |
92    +---------+------------+------------------------------------------------------------------------+
93    |  char*  |    name    | Name of the symbol                                                     |
94    +---------+------------+------------------------------------------------------------------------+
95    |  size_t |    size    | Size of the entry info (0 if it is a function)                         |
96    +---------+------------+------------------------------------------------------------------------+
97    | int32_t |    flags   | Flags associated with the entry (see :ref:`table-offload_entry_flags`) |
98    +---------+------------+------------------------------------------------------------------------+
99    | int32_t |  reserved  | Reserved, to be used by the runtime library.                           |
100    +---------+------------+------------------------------------------------------------------------+
101
102The address of the global symbol will be set to the device pointer value by the
103runtime once the device image is loaded. The flags are set to indicate the
104handling required for the offloading entry. If the offloading entry is an entry
105to a target region it can have one of the following :ref:`entry flags
106<table-offload_entry_flags>`.
107
108  .. table:: Target Region Entry Flags
109    :name: table-offload_entry_flags
110
111    +----------------------------------+-------+-----------------------------------------+
112    |                Name              | Value | Description                             |
113    +==================================+=======+=========================================+
114    | OMPTargetRegionEntryTargetRegion | 0x00  | Mark the entry as generic target region |
115    +----------------------------------+-------+-----------------------------------------+
116    | OMPTargetRegionEntryCtor         | 0x02  | Mark the entry as a global constructor  |
117    +----------------------------------+-------+-----------------------------------------+
118    | OMPTargetRegionEntryDtor         | 0x04  | Mark the entry as a global destructor   |
119    +----------------------------------+-------+-----------------------------------------+
120
121If the offloading entry is a global variable, indicated by a non-zero size, it
122will instead have one of the following :ref:`global
123<table-offload_global_flags>` flags.
124
125  .. table:: Target Region Global
126    :name: table-offload_global_flags
127
128    +-----------------------------+-------+---------------------------------------------------------------+
129    |          Name               | Value | Description                                                   |
130    +=============================+=======+===============================================================+
131    | OMPTargetGlobalVarEntryTo   | 0x00  | Mark the entry as a 'to' attribute (w.r.t. the to clause)     |
132    +-----------------------------+-------+---------------------------------------------------------------+
133    | OMPTargetGlobalVarEntryLink | 0x01  | Mark the entry as a 'link' attribute (w.r.t. the link clause) |
134    +-----------------------------+-------+---------------------------------------------------------------+
135
136The target offload entries are used by the runtime to access the device kernels
137and globals that will be provided by the final device image. Each offloading
138entry is set to use the ``omp_offloading_entries`` section. When the final
139application is created the linker will provide the
140``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols
141which are used to create the :ref:`final image <Device Binary Wrapping>`.
142
143This information is used by the device compilation stage to determine which
144symbols need to be exported from the device. We use the ``omp_offload.info``
145metadata node to pass this information device compilation stage.
146
147Accessing Entries on the Device
148^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
149
150Accessing the entries in the device is done using the address field in the
151:ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set
152the address to the pointer associated with the device image during runtime
153initialization. This is used to call the corresponding kernel function when
154entering a ``#pragma omp target`` region. For variables, the runtime maintains a
155table mapping host pointers to device pointers. Global variables inside a
156``#pragma omp target declare`` directive are first initialized to the host's
157address. Once the device address is initialized we insert it into the table to
158map the host address to the device address.
159
160Debugging Information
161^^^^^^^^^^^^^^^^^^^^^
162
163We generate structures to hold debugging information that is passed to
164``libomptarget``. This allows the front-end to generate information the runtime
165library uses for more informative error messages. This is done using the
166standard :ref:`identifier structure <table-ident_t_structure>` used in
167``libomp`` and ``libomptarget``. This is used to pass information and source
168locations to the runtime.
169
170  .. table:: ident_t Structure
171    :name: table-ident_t_structure
172
173    +---------+------------+-----------------------------------------------------------------------------+
174    |   Type  | Identifier | Description                                                                 |
175    +=========+============+=============================================================================+
176    | int32_t |  reserved  | Reserved, to be used by the runtime library.                                |
177    +---------+------------+-----------------------------------------------------------------------------+
178    | int32_t |   flags    | Flags used to indicate some features, mostly unused.                        |
179    +---------+------------+-----------------------------------------------------------------------------+
180    | int32_t |  reserved  | Reserved, to be used by the runtime library.                                |
181    +---------+------------+-----------------------------------------------------------------------------+
182    | int32_t |  reserved  | Reserved, to be used by the runtime library.                                |
183    +---------+------------+-----------------------------------------------------------------------------+
184    |  char*  |  psource   | Program source information, stored as ";filename;function;line;column;;\\0" |
185    +---------+------------+-----------------------------------------------------------------------------+
186
187If debugging information is enabled, we will also create strings to indicate the
188names and declarations of variables mapped in target regions. These have the
189same format as the source location in the :ref:`identifier structure
190<table-ident_t_structure>`, but the function name is replaced with the variable
191name.
192
193.. _Device Compilation:
194
195Offload Device Compilation
196--------------------------
197
198The input file is compiled for each active device toolchain. The device
199compilation stage is performed differently from the host stage. Namely, we do
200not generate any offloading entries. This is set by passing the
201``-fopenmp-is-device`` flag to the front-end. We use the host bitcode to
202determine which symbols to export from the device. The bitcode file is passed in
203from the previous stage using the ``-fopenmp-host-ir-file-path`` flag.
204Compilation is otherwise performed as it would be for any other target triple.
205
206When compiling for the OpenMP device, we set the visibility of all device
207symbols to be ``protected`` by default. This improves performance and prevents a
208class of errors where a symbol in the target device could preempt a host
209library.
210
211The OpenMP runtime library is linked in during compilation to provide the
212implementations for standard OpenMP functionality. For GPU targets this is done
213by linking in a special bitcode library during compilation, (e.g.
214``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag.
215Other device libraries, such as CUDA's libdevice, are also linked this way. If
216the target is a standard architecture with an existing ``libomp``
217implementation, that will be linked instead. Finally, device tools are used to
218create a relocatable device object file that can be embedded in the host.
219
220.. _Creating Fat Objects:
221
222Creating Fat Objects
223--------------------
224
225A fat binary is a binary file that contains information intended for another
226device. We create a fat object by embedding the output of the device compilation
227stage into the host as a named section. The output from the device compilation
228is passed to the host backend using the ``-fembed-offload-object`` flag. This
229inserts the object as a global in the host's IR. The section name contains the
230target triple and architecture that the data corresponds to for later use.
231Typically we will also add an extra string to the section name to prevent it
232from being merged with other sections if the user performs relocatable linking
233on the object.
234
235.. code-block:: llvm
236
237  @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading.nvptx64.sm_70."
238
239The device code will then be placed in the corresponding section one the backend
240is run on the host, creating a fat object. Using fat objects allows us to treat
241offloading objects as standard host objects. The final object file should
242contain the following :ref:`offloading sections <table-offloading_sections>`. We
243will use this information when :ref:`Device Linking`.
244
245  .. table:: Offloading Sections
246    :name: table-offloading_sections
247
248    +----------------------------------+--------------------------------------------------------------------+
249    |             Section              | Description                                                        |
250    +==================================+====================================================================+
251    | omp_offloading_entries           | Offloading entry information (see :ref:`table-tgt_offload_entry`)  |
252    +----------------------------------+--------------------------------------------------------------------+
253    | .llvm.offloading.<triple>.<arch> | Embedded device object file for the target device and architecture |
254    +----------------------------------+--------------------------------------------------------------------+
255
256.. _Device Linking:
257
258Linking Target Device Code
259--------------------------
260
261Objects containing :ref:`table-offloading_sections` require special handling to
262create an executable device image. This is done using a Clang tool, see
263:doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper
264over the host linking job. It scans the input object files for the offloading
265sections and runs the appropriate device linking action. The linked device image
266is then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load the
267device image and link it with the host.
268
269The linker wrapper tool supports linking bitcode files through link time
270optimization (LTO). This is used whenever the object files embedded in the host
271contain LLVM bitcode. Bitcode will be embedded for architectures that do not
272support a relocatable object format, such as AMDGPU or SPIR-V, or if the user
273requested it using the ``-foffload-lto`` flag.
274
275.. _Device Binary Wrapping:
276
277Device Binary Wrapping
278----------------------
279
280Various structures and functions are used to create the information necessary to
281offload code on the device. We use the :ref:`linked device executable <Device
282Linking>` with the corresponding offloading entries to create the symbols
283necessary to load and execute the device image.
284
285Structure Types
286^^^^^^^^^^^^^^^
287
288Several different structures are used to store offloading information. The
289:ref:`device image structure <table-device_image_structure>` stores a single
290linked device image and its associated offloading entries. The offloading
291entries are stored using the ``__start_omp_offloading_entries`` and
292``__stop_omp_offloading_entries`` symbols generated by the linker using the
293:ref:`table-tgt_offload_entry`.
294
295  .. table:: __tgt_device_image Structure
296    :name: table-device_image_structure
297
298    +----------------------+--------------+----------------------------------------+
299    |         Type         |  Identifier  | Description                            |
300    +======================+==============+========================================+
301    |         void*        |  ImageStart  | Pointer to the target code start       |
302    +----------------------+--------------+----------------------------------------+
303    |         void*        |   ImageEnd   | Pointer to the target code end         |
304    +----------------------+--------------+----------------------------------------+
305    | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries |
306    +----------------------+--------------+----------------------------------------+
307    | __tgt_offload_entry* |  EntriesEnd  | End of table (non inclusive)           |
308    +----------------------+--------------+----------------------------------------+
309
310The target :ref:`target binary descriptor <table-target_binary_descriptor>` is
311used to store all binary images and offloading entries in an array.
312
313  .. table:: __tgt_bin_desc Structure
314    :name: table-target_binary_descriptor
315
316    +----------------------+------------------+------------------------------------------+
317    |         Type         |    Identifier    | Description                              |
318    +======================+==================+==========================================+
319    |        int32_t       |  NumDeviceImages | Number of device types supported         |
320    +----------------------+------------------+------------------------------------------+
321    |  __tgt_device_image* |   DeviceImages   | Array of device images (1 per dev. type) |
322    +----------------------+------------------+------------------------------------------+
323    | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries     |
324    +----------------------+------------------+------------------------------------------+
325    | __tgt_offload_entry* |  HostEntriesEnd  | End of table (non inclusive)             |
326    +----------------------+------------------+------------------------------------------+
327
328Global Variables
329----------------
330
331:ref:`table-global_variables` lists various global variables, along with their
332type and their explicit ELF sections, which are used to store device images and
333related symbols.
334
335  .. table:: Global Variables
336    :name: table-global_variables
337
338    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
339    |            Variable            |         Type        |       ELF Section       |                    Description                          |
340    +================================+=====================+=========================+=========================================================+
341    | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table.             |
342    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
343    | __stop_omp_offloading_entries  | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table.               |
344    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
345    | __dummy.omp_offloading.entry   | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries          |
346    |                                |                     |                         | section to force linker to define begin/end             |
347    |                                |                     |                         | symbols defined above.                                  |
348    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
349    | .omp_offloading.device_image   |  __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image.              |
350    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
351    | .omp_offloading.device_image.N |  __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image.            |
352    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
353    | .omp_offloading.device_images  |  __tgt_device_image | .omp_offloading_entries | Array of images.                                        |
354    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
355    | .omp_offloading.descriptor     | __tgt_bin_desc      | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) |
356    +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+
357
358.. _binary_descriptor:
359
360Binary Descriptor for Device Images
361^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
362
363This object is passed to the offloading runtime at program startup and it
364describes all device images available in the executable or shared library. It
365is defined as follows:
366
367.. code-block:: c
368
369  __attribute__((visibility("hidden")))
370  extern __tgt_offload_entry *__start_omp_offloading_entries;
371  __attribute__((visibility("hidden")))
372  extern __tgt_offload_entry *__stop_omp_offloading_entries;
373  static const char Image0[] = { <Bufs.front() contents> };
374  ...
375  static const char ImageN[] = { <Bufs.back() contents> };
376  static const __tgt_device_image Images[] = {
377    {
378      Image0,                            /*ImageStart*/
379      Image0 + sizeof(Image0),           /*ImageEnd*/
380      __start_omp_offloading_entries,    /*EntriesBegin*/
381      __stop_omp_offloading_entries      /*EntriesEnd*/
382    },
383    ...
384    {
385      ImageN,                            /*ImageStart*/
386      ImageN + sizeof(ImageN),           /*ImageEnd*/
387      __start_omp_offloading_entries,    /*EntriesBegin*/
388      __stop_omp_offloading_entries      /*EntriesEnd*/
389    }
390  };
391  static const __tgt_bin_desc BinDesc = {
392    sizeof(Images) / sizeof(Images[0]),  /*NumDeviceImages*/
393    Images,                              /*DeviceImages*/
394    __start_omp_offloading_entries,      /*HostEntriesBegin*/
395    __stop_omp_offloading_entries        /*HostEntriesEnd*/
396  };
397
398
399Global Constructor and Destructor
400^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
401
402The global constructor (``.omp_offloading.descriptor_reg()``) registers the
403device images with the runtime by calling the ``__tgt_register_lib()`` runtime
404function. The constructor is explicitly defined in ``.text.startup`` section and
405is run once when the program starts. Similarly, the global destructor
406(``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for
407the destructor and is also defined in ``.text.startup`` section and run when the
408program exits.
409
410Offloading Example
411------------------
412
413This section contains a simple example of generating offloading code using
414OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine.
415
416.. code-block:: c++
417
418    #include <complex>
419
420    using complex = std::complex<double>;
421
422    void zaxpy(complex *X, complex *Y, complex D, std::size_t N) {
423    #pragma omp target teams distribute parallel for
424      for (std::size_t i = 0; i < N; ++i)
425        Y[i] = D * X[i] + Y[i];
426    }
427
428    int main() {
429      const std::size_t N = 1024;
430      complex X[N], Y[N], D;
431    #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
432      zaxpy(X, Y, D, N);
433    }
434
435This code is compiled using the following Clang flags.
436
437.. code-block:: console
438
439    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c
440
441The output section in the object file can be seen using the ``readelf`` utility
442
443.. code-block:: text
444
445  $ llvm-readelf -WS zaxpy.o
446  [Nr] Name                                       Type
447  ...
448  [34] omp_offloading_entries                     PROGBITS
449  [35] .llvm.offloading.nvptx64-nvidia-cuda.sm_70 PROGBITS
450
451Compiling this file again will invoke the ``clang-linker-wrapper`` utility to
452extract and link the device code stored at the section named
453``.llvm.offloading.nvptx64-nvidia-cuda.sm_70`` and then use entries stored in
454the section named ``omp_offloading_entries`` to create the symbols necessary for
455``libomptarget`` to register the device image and call the entry function.
456
457.. code-block:: console
458
459    $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy
460    $ ./zaxpy
461
462We can see the steps created by clang to generate the offloading code using the
463``-ccc-print-phases`` option in Clang. This matches the description in
464:ref:`Offloading Overview`.
465
466.. code-block:: console
467
468    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp
469    # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc"
470    # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s"
471    # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o"
472    # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o"
473    # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out"
474