1Support, Getting Involved, and FAQ 2================================== 3 4Please do not hesitate to reach out to us via [email protected] or join 5one of our :ref:`regular calls <calls>`. Some common questions are answered in 6the :ref:`faq`. 7 8.. _calls: 9 10Calls 11----- 12 13OpenMP in LLVM Technical Call 14^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 15 16- Development updates on OpenMP (and OpenACC) in the LLVM Project, including Clang, optimization, and runtime work. 17- Join `OpenMP in LLVM Technical Call <https://bluejeans.com/544112769//webrtc>`__. 18- Time: Weekly call on every Wednesday 7:00 AM Pacific time. 19- Meeting minutes are `here <https://docs.google.com/document/d/1Tz8WFN13n7yJ-SCE0Qjqf9LmjGUw0dWO9Ts1ss4YOdg/edit>`__. 20- Status tracking `page <https://openmp.llvm.org/docs>`__. 21 22 23OpenMP in Flang Technical Call 24^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 25- Development updates on OpenMP and OpenACC in the Flang Project. 26- Join `OpenMP in Flang Technical Call <https://bit.ly/39eQW3o>`_ 27- Time: Weekly call on every Thursdays 8:00 AM Pacific time. 28- Meeting minutes are `here <https://docs.google.com/document/d/1yA-MeJf6RYY-ZXpdol0t7YoDoqtwAyBhFLr5thu5pFI>`__. 29- Status tracking `page <https://docs.google.com/spreadsheets/d/1FvHPuSkGbl4mQZRAwCIndvQx9dQboffiD-xD0oqxgU0/edit#gid=0>`__. 30 31 32.. _faq: 33 34FAQ 35--- 36 37.. note:: 38 The FAQ is a work in progress and most of the expected content is not 39 yet available. While you can expect changes, we always welcome feedback and 40 additions. Please contact, e.g., through ``[email protected]``. 41 42 43Q: How to contribute a patch to the webpage or any other part? 44^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 45 46All patches go through the regular `LLVM review process 47<https://llvm.org/docs/Contributing.html#how-to-submit-a-patch>`_. 48 49 50.. _build_offload_capable_compiler: 51 52Q: How to build an OpenMP GPU offload capable compiler? 53^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 54To build an *effective* OpenMP offload capable compiler, only one extra CMake 55option, `LLVM_ENABLE_RUNTIMES="openmp"`, is needed when building LLVM (Generic 56information about building LLVM is available `here 57<https://llvm.org/docs/GettingStarted.html>`__.). Make sure all backends that 58are targeted by OpenMP to be enabled. By default, Clang will be built with all 59backends enabled. When building with `LLVM_ENABLE_RUNTIMES="openmp"` OpenMP 60should not be enabled in `LLVM_ENABLE_PROJECTS` because it is enabled by 61default. 62 63For Nvidia offload, please see :ref:`build_nvidia_offload_capable_compiler`. 64For AMDGPU offload, please see :ref:`build_amdgpu_offload_capable_compiler`. 65 66.. note:: 67 The compiler that generates the offload code should be the same (version) as 68 the compiler that builds the OpenMP device runtimes. The OpenMP host runtime 69 can be built by a different compiler. 70 71.. _advanced_builds: https://llvm.org//docs/AdvancedBuilds.html 72 73.. _build_nvidia_offload_capable_compiler: 74 75Q: How to build an OpenMP NVidia offload capable compiler? 76^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 77The Cuda SDK is required on the machine that will execute the openmp application. 78 79If your build machine is not the target machine or automatic detection of the 80available GPUs failed, you should also set: 81 82- `CLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_XX` where `XX` is the architecture of your GPU, e.g, 80. 83- `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=YY` where `YY` is the numeric compute capacity of your GPU, e.g., 75. 84 85 86.. _build_amdgpu_offload_capable_compiler: 87 88Q: How to build an OpenMP AMDGPU offload capable compiler? 89^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 90A subset of the `ROCm <https://github.com/radeonopencompute>`_ toolchain is 91required to build the LLVM toolchain and to execute the openmp application. 92Either install ROCm somewhere that cmake's find_package can locate it, or 93build the required subcomponents ROCt and ROCr from source. 94 95The two components used are ROCT-Thunk-Interface, roct, and ROCR-Runtime, rocr. 96Roct is the userspace part of the linux driver. It calls into the driver which 97ships with the linux kernel. It is an implementation detail of Rocr from 98OpenMP's perspective. Rocr is an implementation of `HSA 99<http://www.hsafoundation.com>`_. 100 101.. code-block:: text 102 103 SOURCE_DIR=same-as-llvm-source # e.g. the checkout of llvm-project, next to openmp 104 BUILD_DIR=somewhere 105 INSTALL_PREFIX=same-as-llvm-install 106 107 cd $SOURCE_DIR 108 git clone [email protected]:RadeonOpenCompute/ROCT-Thunk-Interface.git -b roc-4.2.x \ 109 --single-branch 110 git clone [email protected]:RadeonOpenCompute/ROCR-Runtime.git -b rocm-4.2.x \ 111 --single-branch 112 113 cd $BUILD_DIR && mkdir roct && cd roct 114 cmake $SOURCE_DIR/ROCT-Thunk-Interface/ -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \ 115 -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF 116 make && make install 117 118 cd $BUILD_DIR && mkdir rocr && cd rocr 119 cmake $SOURCE_DIR/ROCR-Runtime/src -DIMAGE_SUPPORT=OFF \ 120 -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DCMAKE_BUILD_TYPE=Release \ 121 -DBUILD_SHARED_LIBS=ON 122 make && make install 123 124``IMAGE_SUPPORT`` requires building rocr with clang and is not used by openmp. 125 126Provided cmake's find_package can find the ROCR-Runtime package, LLVM will 127build a tool ``bin/amdgpu-arch`` which will print a string like ``gfx906`` when 128run if it recognises a GPU on the local system. LLVM will also build a shared 129library, libomptarget.rtl.amdgpu.so, which is linked against rocr. 130 131With those libraries installed, then LLVM build and installed, try: 132 133.. code-block:: shell 134 135 clang -O2 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa example.c -o example && ./example 136 137Q: What are the known limitations of OpenMP AMDGPU offload? 138^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 139LD_LIBRARY_PATH or rpath/runpath are required to find libomp.so and libomptarget.so 140 141There is no libc. That is, malloc and printf do not exist. Libm is implemented in terms 142of the rocm device library, which will be searched for if linking with '-lm'. 143 144Some versions of the driver for the radeon vii (gfx906) will error unless the 145environment variable 'export HSA_IGNORE_SRAMECC_MISREPORT=1' is set. 146 147It is a recent addition to LLVM and the implementation differs from that which 148has been shipping in ROCm and AOMP for some time. Early adopters will encounter 149bugs. 150 151Q: What are the LLVM components used in offloading and how are they found? 152^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 153The libraries used by an executable compiled for target offloading are: 154 155- ``libomp.so`` (or similar), the host openmp runtime 156- ``libomptarget.so``, the target-agnostic target offloading openmp runtime 157- plugins loaded by libomptarget.so: 158 159 - ``libomptarget.rtl.amdgpu.so`` 160 - ``libomptarget.rtl.cuda.so`` 161 - ``libomptarget.rtl.x86_64.so`` 162 - ``libomptarget.rtl.ve.so`` 163 - and others 164 165- dependencies of those plugins, e.g. cuda/rocr for nvptx/amdgpu 166 167The compiled executable is dynamically linked against a host runtime, e.g. 168``libomp.so``, and against the target offloading runtime, ``libomptarget.so``. These 169are found like any other dynamic library, by setting rpath or runpath on the 170executable, by setting ``LD_LIBRARY_PATH``, or by adding them to the system search. 171 172``libomptarget.so`` has rpath or runpath (whichever the system default is) set to 173``$ORIGIN``, and the plugins are located next to it, so it will find the plugins 174without any environment variables set. If ``LD_LIBRARY_PATH`` is set, whether it 175overrides which plugin is found depends on whether your system treats ``-Wl,-rpath`` 176as RPATH or RUNPATH. 177 178The plugins will try to find their dependencies in plugin-dependent fashion. 179 180The cuda plugin is dynamically linked against libcuda if cmake found it at 181compiler build time. Otherwise it will attempt to dlopen ``libcuda.so``. It does 182not have rpath set. 183 184The amdgpu plugin is linked against ROCr if cmake found it at compiler build 185time. Otherwise it will attempt to dlopen ``libhsa-runtime64.so``. It has rpath 186set to ``$ORIGIN``, so installing ``libhsa-runtime64.so`` in the same directory is a 187way to locate it without environment variables. 188 189In addition to those, there is a compiler runtime library called deviceRTL. 190This is compiled from mostly common code into an architecture specific 191bitcode library, e.g. ``libomptarget-nvptx-sm_70.bc``. 192 193Clang and the deviceRTL need to match closely as the interface between them 194changes frequently. Using both from the same monorepo checkout is strongly 195recommended. 196 197Unlike the host side which lets environment variables select components, the 198deviceRTL that is located in the clang lib directory is preferred. Only if 199it is absent, the ``LIBRARY_PATH`` environment variable is searched to find a 200bitcode file with the right name. This can be overridden by passing a clang 201flag, ``--libomptarget-nvptx-bc-path`` or ``--libomptarget-amdgcn-bc-path``. That 202can specify a directory or an exact bitcode file to use. 203 204 205Q: Does OpenMP offloading support work in pre-packaged LLVM releases? 206^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 207For now, the answer is most likely *no*. Please see :ref:`build_offload_capable_compiler`. 208 209Q: Does OpenMP offloading support work in packages distributed as part of my OS? 210^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 211For now, the answer is most likely *no*. Please see :ref:`build_offload_capable_compiler`. 212 213 214.. _math_and_complex_in_target_regions: 215 216Q: Does Clang support `<math.h>` and `<complex.h>` operations in OpenMP target on GPUs? 217^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 218 219Yes, LLVM/Clang allows math functions and complex arithmetic inside of OpenMP 220target regions that are compiled for GPUs. 221 222Clang provides a set of wrapper headers that are found first when `math.h` and 223`complex.h`, for C, `cmath` and `complex`, for C++, or similar headers are 224included by the application. These wrappers will eventually include the system 225version of the corresponding header file after setting up a target device 226specific environment. The fact that the system header is included is important 227because they differ based on the architecture and operating system and may 228contain preprocessor, variable, and function definitions that need to be 229available in the target region regardless of the targeted device architecture. 230However, various functions may require specialized device versions, e.g., 231`sin`, and others are only available on certain devices, e.g., `__umul64hi`. To 232provide "native" support for math and complex on the respective architecture, 233Clang will wrap the "native" math functions, e.g., as provided by the device 234vendor, in an OpenMP begin/end declare variant. These functions will then be 235picked up instead of the host versions while host only variables and function 236definitions are still available. Complex arithmetic and functions are support 237through a similar mechanism. It is worth noting that this support requires 238`extensions to the OpenMP begin/end declare variant context selector 239<https://clang.llvm.org/docs/AttributeReference.html#pragma-omp-declare-variant>`__ 240that are exposed through LLVM/Clang to the user as well. 241 242Q: What is a way to debug errors from mapping memory to a target device? 243^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 244 245An experimental way to debug these errors is to use :ref:`remote process 246offloading <remote_offloading_plugin>`. 247By using ``libomptarget.rtl.rpc.so`` and ``openmp-offloading-server``, it is 248possible to explicitly perform memory transfers between processes on the host 249CPU and run sanitizers while doing so in order to catch these errors. 250 251Q: Why does my application say "Named symbol not found" and abort when I run it? 252^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 253 254This is most likely caused by trying to use OpenMP offloading with static 255libraries. Static libraries do not contain any device code, so when the runtime 256attempts to execute the target region it will not be found and you will get an 257an error like this. 258 259.. code-block:: text 260 261 CUDA error: Loading '__omp_offloading_fd02_3231c15__Z3foov_l2' Failed 262 CUDA error: named symbol not found 263 Libomptarget error: Unable to generate entries table for device id 0. 264 265Currently, the only solution is to change how the application is built and avoid 266the use of static libraries. 267 268Q: Can I use dynamically linked libraries with OpenMP offloading? 269^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 270 271Dynamically linked libraries can be only used if there is no device code split 272between the library and application. Anything declared on the device inside the 273shared library will not be visible to the application when it's linked. 274 275Q: How to build an OpenMP offload capable compiler with an outdated host compiler? 276^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 277 278Enabling the OpenMP runtime will perform a two-stage build for you. 279If your host compiler is different from your system-wide compiler, you may need 280to set the CMake variable `GCC_INSTALL_PREFIX` so clang will be able to find the 281correct GCC toolchain in the second stage of the build. 282 283For example, if your system-wide GCC installation is too old to build LLVM and 284you would like to use a newer GCC, set the CMake variable `GCC_INSTALL_PREFIX` 285to inform clang of the GCC installation you would like to use in the second stage. 286 287Q: How can I include OpenMP offloading support in my CMake project? 288^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 289 290Currently, there is an experimental CMake find module for OpenMP target 291offloading provided by LLVM. It will attempt to find OpenMP target offloading 292support for your compiler. The flags necessary for OpenMP target offloading will 293be loaded into the ``OpenMPTarget::OpenMPTarget_<device>`` target or the 294``OpenMPTarget_<device>_FLAGS`` variable if successful. Currently supported 295devices are ``AMDGPU`` and ``NVPTX``. 296 297To use this module, simply add the path to CMake's current module path and call 298``find_package``. The module will be installed with your OpenMP installation by 299default. Including OpenMP offloading support in an application should now only 300require a few additions. 301 302.. code-block:: cmake 303 304 cmake_minimum_required(VERSION 3.13.4) 305 project(offloadTest VERSION 1.0 LANGUAGES CXX) 306 307 list(APPEND CMAKE_MODULE_PATH "${PATH_TO_OPENMP_INSTALL}/lib/cmake/openmp") 308 309 find_package(OpenMPTarget REQUIRED NVPTX) 310 311 add_executable(offload) 312 target_link_libraries(offload PRIVATE OpenMPTarget::OpenMPTarget_NVPTX) 313 target_sources(offload PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src/Main.cpp) 314 315Using this module requires at least CMake version 3.13.4. Supported languages 316are C and C++ with Fortran support planned in the future. Compiler support is 317best for Clang but this module should work for other compiler vendors such as 318IBM, GNU. 319 320Q: What does 'Stack size for entry function cannot be statically determined' mean? 321^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 322 323This is a warning that the Nvidia tools will sometimes emit if the offloading 324region is too complex. Normally, the CUDA tools attempt to statically determine 325how much stack memory each thread. This way when the kernel is launched each 326thread will have as much memory as it needs. If the control flow of the kernel 327is too complex, containing recursive calls or nested parallelism, this analysis 328can fail. If this warning is triggered it means that the kernel may run out of 329stack memory during execution and crash. The environment variable 330``LIBOMPTARGET_STACK_SIZE`` can be used to increase the stack size if this 331occurs. 332