1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
14
15LLVM
16====
17
18.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26  .. table:: AMDGPU Architectures
27     :name: amdgpu-architecture-table
28
29     ============ ==============================================================
30     Architecture Description
31     ============ ==============================================================
32     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34     ============ ==============================================================
35
36  .. table:: AMDGPU Vendors
37     :name: amdgpu-vendor-table
38
39     ============ ==============================================================
40     Vendor       Description
41     ============ ==============================================================
42     ``amd``      Can be used for all AMD GPU usage.
43     ``mesa3d``   Can be used if the OS is ``mesa3d``.
44     ============ ==============================================================
45
46  .. table:: AMDGPU Operating Systems
47     :name: amdgpu-os-table
48
49     ============== ============================================================
50     OS             Description
51     ============== ============================================================
52     *<empty>*      Defaults to the *unknown* OS.
53     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
54                    such as AMD's ROCm [AMD-ROCm]_.
55     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
56                    runtime.
57     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
58                    runtime.
59     ============== ============================================================
60
61  .. table:: AMDGPU Environments
62     :name: amdgpu-environment-table
63
64     ============ ==============================================================
65     Environment  Description
66     ============ ==============================================================
67     *<empty>*    Defaults to ``opencl``.
68     ``opencl``   OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69     ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70                  used (see :ref:`amdgpu-address-spaces`).
71     ``amdgiz``   Same as ``opencl`` except a different address space mapping is
72                  used (see :ref:`amdgpu-address-spaces`).
73     ``hcc``      AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74     ============ ==============================================================
75
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84  .. table:: AMDGPU Processors
85     :name: amdgpu-processor-table
86
87     =========== =============== ============ ===== ========= ======= ==================
88     Processor   Alternative     Target       dGPU/ Target    ROCm    Example
89                 Processor       Triple       APU   Features  Support Products
90                                 Architecture       Supported
91                                                    [Default]
92     =========== =============== ============ ===== ========= ======= ==================
93     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
94     -----------------------------------------------------------------------------------
95     ``r600``                    ``r600``     dGPU
96     ``r630``                    ``r600``     dGPU
97     ``rs880``                   ``r600``     dGPU
98     ``rv670``                   ``r600``     dGPU
99     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
100     -----------------------------------------------------------------------------------
101     ``rv710``                   ``r600``     dGPU
102     ``rv730``                   ``r600``     dGPU
103     ``rv770``                   ``r600``     dGPU
104     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
105     -----------------------------------------------------------------------------------
106     ``cedar``                   ``r600``     dGPU
107     ``redwood``                 ``r600``     dGPU
108     ``sumo``                    ``r600``     dGPU
109     ``juniper``                 ``r600``     dGPU
110     ``cypress``                 ``r600``     dGPU
111     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
112     -----------------------------------------------------------------------------------
113     ``barts``                   ``r600``     dGPU
114     ``turks``                   ``r600``     dGPU
115     ``caicos``                  ``r600``     dGPU
116     ``cayman``                  ``r600``     dGPU
117     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
118     -----------------------------------------------------------------------------------
119     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
120     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU
121                 - ``verde``
122                 - ``oland``
123                 - ``hainan``
124     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
125     -----------------------------------------------------------------------------------
126     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
127                                                                      - A6 Pro-7050B
128                                                                      - A8-7100
129                                                                      - A8 Pro-7150B
130                                                                      - A10-7300
131                                                                      - A10 Pro-7350B
132                                                                      - FX-7500
133                                                                      - A8-7200P
134                                                                      - A10-7400P
135                                                                      - FX-7600P
136     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
137                                                                      - FirePro W9100
138                                                                      - FirePro S9150
139                                                                      - FirePro S9170
140     ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
141                                                                      - Radeon R9 290x
142                                                                      - Radeon R390
143                                                                      - Radeon R390x
144     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
145                 - ``mullins``                                        - E1-2200
146                                                                      - E1-2500
147                                                                      - E2-3000
148                                                                      - E2-3800
149                                                                      - A4-5000
150                                                                      - A4-5100
151                                                                      - A6-5200
152                                                                      - A4 Pro-3340B
153     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
154                                                                      - Radeon HD 8770
155                                                                      - R7 260
156                                                                      - R7 260X
157     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
158     -----------------------------------------------------------------------------------
159     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
160                                                      [on]            - Pro A6-8500B
161                                                                      - A8-8600P
162                                                                      - Pro A8-8600B
163                                                                      - FX-8800P
164                                                                      - Pro A12-8800B
165     \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
166                                                      [on]            - Pro A10-8700B
167                                                                      - A10-8780P
168     \                           ``amdgcn``   APU   - xnack           - A10-9600P
169                                                      [on]            - A10-9630P
170                                                                      - A12-9700P
171                                                                      - A12-9730P
172                                                                      - FX-9800P
173                                                                      - FX-9830P
174     \                           ``amdgcn``   APU   - xnack           - E2-9010
175                                                      [on]            - A6-9210
176                                                                      - A9-9410
177     ``gfx802``  - ``tonga``     ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
178                 - ``iceland``                        [off]           - FirePro S7100
179                                                                      - FirePro W7100
180                                                                      - Radeon R285
181                                                                      - Radeon R9 380
182                                                                      - Radeon R9 385
183                                                                      - Mobile FirePro
184                                                                        M7170
185     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
186                                                      [off]           - Radeon R9 Fury
187                                                                      - Radeon R9 FuryX
188                                                                      - Radeon Pro Duo
189                                                                      - FirePro S9300x2
190                                                                      - Radeon Instinct MI8
191     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
192                                                      [off]           - Radeon RX 480
193                                                                      - Radeon Instinct MI6
194     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
195                                                      [off]
196     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
197                                                      [on]
198     **GCN GFX9** [AMD-GCN-GFX9]_
199     -----------------------------------------------------------------------------------
200     ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
201                                                      [off]             Frontier Edition
202                                                                      - Radeon RX Vega 56
203                                                                      - Radeon RX Vega 64
204                                                                      - Radeon RX Vega 64
205                                                                        Liquid
206                                                                      - Radeon Instinct MI25
207     ``gfx902``                  ``amdgcn``   APU   - xnack           *TBA*
208                                                      [on]
209                                                                      .. TODO
210                                                                         Add product
211                                                                         names.
212     =========== =============== ============ ===== ========= ======= ==================
213
214.. _amdgpu-target-features:
215
216Target Features
217---------------
218
219Target features control how code is generated to support certain
220processor specific features. Not all target features are supported by
221all processors. The runtime must ensure that the features supported by
222the device used to execute the code match the features enabled when
223generating the code. A mismatch of features may result in incorrect
224execution, or a reduction in performance.
225
226The target features supported by each processor, and the default value
227used if not specified explicitly, is listed in
228:ref:`amdgpu-processor-table`.
229
230Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
231target features.
232
233For example:
234
235``-mxnack``
236  Enable the ``xnack`` feature.
237``-mno-xnack``
238  Disable the ``xnack`` feature.
239
240  .. table:: AMDGPU Target Features
241     :name: amdgpu-target-feature-table
242
243     ============== ==================================================
244     Target Feature Description
245     ============== ==================================================
246     -m[no-]xnack   Enable/disable generating code that has
247                    memory clauses that are compatible with
248                    having XNACK replay enabled.
249
250                    This is used for demand paging and page
251                    migration. If XNACK replay is enabled in
252                    the device, then if a page fault occurs
253                    the code may execute incorrectly if the
254                    ``xnack`` feature is not enabled. Executing
255                    code that has the feature enabled on a
256                    device that does not have XNACK replay
257                    enabled will execute correctly, but may
258                    be less performant than code with the
259                    feature disabled.
260     ============== ==================================================
261
262.. _amdgpu-address-spaces:
263
264Address Spaces
265--------------
266
267The AMDGPU backend uses the following address space mappings.
268
269The memory space names used in the table, aside from the region memory space, is
270from the OpenCL standard.
271
272LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
273
274  .. table:: Address Space Mapping
275     :name: amdgpu-address-space-mapping-table
276
277     ================== ================= ================= ================= =================
278     LLVM Address Space Memory Space
279     ------------------ -----------------------------------------------------------------------
280     \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
281     ================== ================= ================= ================= =================
282     0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
283     1                  Global            Global            Global            Global
284     2                  Constant          Constant          Constant          Region (GDS)
285     3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286     4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
287     5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
288     ================== ================= ================= ================= =================
289
290Current Default
291  This is the current default address space mapping used for all languages
292  except hcc. This will shortly be deprecated.
293
294amdgiz/amdgizcl
295  This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296  is specified as the target triple environment value.
297
298hcc
299  This is the current address space mapping used when ``hcc`` is specified as
300  the target triple environment value.This will shortly be deprecated.
301
302Future Default
303  This will shortly be the only address space mapping for all languages using
304  AMDGPU backend.
305
306.. _amdgpu-memory-scopes:
307
308Memory Scopes
309-------------
310
311This section provides LLVM memory synchronization scopes supported by the AMDGPU
312backend memory model when the target triple OS is ``amdhsa`` (see
313:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
314
315The memory model supported is based on the HSA memory model [HSA]_ which is
316based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317relation is transitive over the synchonizes-with relation independent of scope,
318and synchonizes-with allows the memory scope instances to be inclusive (see
319table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
320
321This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322inclusion and requires the memory scopes to exactly match. However, this
323is conservatively correct for OpenCL.
324
325  .. table:: AMDHSA LLVM Sync Scopes
326     :name: amdgpu-amdhsa-llvm-sync-scopes-table
327
328     ================ ==========================================================
329     LLVM Sync Scope  Description
330     ================ ==========================================================
331     *none*           The default: ``system``.
332
333                      Synchronizes with, and participates in modification and
334                      seq_cst total orderings with, other operations (except
335                      image operations) for all address spaces (except private,
336                      or generic that accesses private) provided the other
337                      operation's sync scope is:
338
339                      - ``system``.
340                      - ``agent`` and executed by a thread on the same agent.
341                      - ``workgroup`` and executed by a thread in the same
342                        workgroup.
343                      - ``wavefront`` and executed by a thread in the same
344                        wavefront.
345
346     ``agent``        Synchronizes with, and participates in modification and
347                      seq_cst total orderings with, other operations (except
348                      image operations) for all address spaces (except private,
349                      or generic that accesses private) provided the other
350                      operation's sync scope is:
351
352                      - ``system`` or ``agent`` and executed by a thread on the
353                        same agent.
354                      - ``workgroup`` and executed by a thread in the same
355                        workgroup.
356                      - ``wavefront`` and executed by a thread in the same
357                        wavefront.
358
359     ``workgroup``    Synchronizes with, and participates in modification and
360                      seq_cst total orderings with, other operations (except
361                      image operations) for all address spaces (except private,
362                      or generic that accesses private) provided the other
363                      operation's sync scope is:
364
365                      - ``system``, ``agent`` or ``workgroup`` and executed by a
366                        thread in the same workgroup.
367                      - ``wavefront`` and executed by a thread in the same
368                        wavefront.
369
370     ``wavefront``    Synchronizes with, and participates in modification and
371                      seq_cst total orderings with, other operations (except
372                      image operations) for all address spaces (except private,
373                      or generic that accesses private) provided the other
374                      operation's sync scope is:
375
376                      - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377                        and executed by a thread in the same wavefront.
378
379     ``singlethread`` Only synchronizes with, and participates in modification
380                      and seq_cst total orderings with, other operations (except
381                      image operations) running in the same thread for all
382                      address spaces (for example, in signal handlers).
383     ================ ==========================================================
384
385AMDGPU Intrinsics
386-----------------
387
388The AMDGPU backend implements the following intrinsics.
389
390*This section is WIP.*
391
392.. TODO
393   List AMDGPU intrinsics
394
395Code Object
396===========
397
398The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399can be linked by ``lld`` to produce a standard ELF shared code object which can
400be loaded and executed on an AMDGPU target.
401
402Header
403------
404
405The AMDGPU backend uses the following ELF header:
406
407  .. table:: AMDGPU ELF Header
408     :name: amdgpu-elf-header-table
409
410     ========================== ===============================
411     Field                      Value
412     ========================== ===============================
413     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
414     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
415     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
416                                - ``ELFOSABI_AMDGPU_HSA``
417                                - ``ELFOSABI_AMDGPU_PAL``
418                                - ``ELFOSABI_AMDGPU_MESA3D``
419     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420                                - ``ELFABIVERSION_AMDGPU_PAL``
421                                - ``ELFABIVERSION_AMDGPU_MESA3D``
422     ``e_type``                 - ``ET_REL``
423                                - ``ET_DYN``
424     ``e_machine``              ``EM_AMDGPU``
425     ``e_entry``                0
426     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
427     ========================== ===============================
428
429..
430
431  .. table:: AMDGPU ELF Header Enumeration Values
432     :name: amdgpu-elf-header-enumeration-values-table
433
434     =============================== =====
435     Name                            Value
436     =============================== =====
437     ``EM_AMDGPU``                   224
438     ``ELFOSABI_NONE``               0
439     ``ELFOSABI_AMDGPU_HSA``         64
440     ``ELFOSABI_AMDGPU_PAL``         65
441     ``ELFOSABI_AMDGPU_MESA3D``      66
442     ``ELFABIVERSION_AMDGPU_HSA``    1
443     ``ELFABIVERSION_AMDGPU_PAL``    0
444     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445     =============================== =====
446
447``e_ident[EI_CLASS]``
448  The ELF class is:
449
450  * ``ELFCLASS32`` for ``r600`` architecture.
451
452  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
453    bit applications.
454
455``e_ident[EI_DATA]``
456  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
457
458``e_ident[EI_OSABI]``
459  One of the following AMD GPU architecture specific OS ABIs
460  (see :ref:`amdgpu-os-table`):
461
462  * ``ELFOSABI_NONE`` for *unknown* OS.
463
464  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
465
466  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
467
468  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
469
470``e_ident[EI_ABIVERSION]``
471  The ABI version of the AMD GPU architecture specific OS ABI to which the code
472  object conforms:
473
474  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
475    runtime ABI.
476
477  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
478    runtime ABI.
479
480  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
481    3D runtime ABI.
482
483``e_type``
484  Can be one of the following values:
485
486
487  ``ET_REL``
488    The type produced by the AMD GPU backend compiler as it is relocatable code
489    object.
490
491  ``ET_DYN``
492    The type produced by the linker as it is a shared code object.
493
494  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
495
496``e_machine``
497  The value ``EM_AMDGPU`` is used for the machine for all processors supported
498  by the ``r600`` and ``amdgcn`` architectures (see
499  :ref:`amdgpu-processor-table`). The specific processor is specified in the
500  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501  :ref:`amdgpu-elf-header-e_flags-table`).
502
503``e_entry``
504  The entry point is 0 as the entry points for individual kernels must be
505  selected in order to invoke them through AQL packets.
506
507``e_flags``
508  The AMDGPU backend uses the following ELF header flags:
509
510  .. table:: AMDGPU ELF Header ``e_flags``
511     :name: amdgpu-elf-header-e_flags-table
512
513     ================================= ========== =============================
514     Name                              Value      Description
515     ================================= ========== =============================
516     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
517     -------------------------------------------- -----------------------------
518     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
519                                                  mask for
520                                                  ``EF_AMDGPU_MACH_xxx`` values
521                                                  defined in
522                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
523     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
524                                                  target feature is
525                                                  enabled for all code
526                                                  contained in the code object.
527                                                  See
528                                                  :ref:`amdgpu-target-features`.
529     ================================= ========== =============================
530
531  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532     :name: amdgpu-ef-amdgpu-mach-table
533
534     ================================= ========== =============================
535     Name                              Value      Description (see
536                                                  :ref:`amdgpu-processor-table`)
537     ================================= ========== =============================
538     ``EF_AMDGPU_MACH_NONE``           0          *not specified*
539     ``EF_AMDGPU_MACH_R600_R600``      1          ``r600``
540     ``EF_AMDGPU_MACH_R600_R630``      2          ``r630``
541     ``EF_AMDGPU_MACH_R600_RS880``     3          ``rs880``
542     ``EF_AMDGPU_MACH_R600_RV670``     4          ``rv670``
543     ``EF_AMDGPU_MACH_R600_RV710``     5          ``rv710``
544     ``EF_AMDGPU_MACH_R600_RV730``     6          ``rv730``
545     ``EF_AMDGPU_MACH_R600_RV770``     7          ``rv770``
546     ``EF_AMDGPU_MACH_R600_CEDAR``     8          ``cedar``
547     ``EF_AMDGPU_MACH_R600_REDWOOD``   9          ``redwood``
548     ``EF_AMDGPU_MACH_R600_SUMO``      10         ``sumo``
549     ``EF_AMDGPU_MACH_R600_JUNIPER``   11         ``juniper``
550     ``EF_AMDGPU_MACH_R600_CYPRESS``   12         ``cypress``
551     ``EF_AMDGPU_MACH_R600_BARTS``     13         ``barts``
552     ``EF_AMDGPU_MACH_R600_TURKS``     14         ``turks``
553     ``EF_AMDGPU_MACH_R600_CAICOS``    15         ``caicos``
554     ``EF_AMDGPU_MACH_R600_CAYMAN``    16         ``cayman``
555     *reserved*                        17-31      Reserved for ``r600``
556                                                  architecture processors.
557     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  32         ``gfx600``
558     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  33         ``gfx601``
559     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  34         ``gfx700``
560     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  35         ``gfx701``
561     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  36         ``gfx702``
562     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  37         ``gfx703``
563     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  38         ``gfx704``
564     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  39         ``gfx801``
565     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  40         ``gfx802``
566     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  41         ``gfx803``
567     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  42         ``gfx810``
568     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  43         ``gfx900``
569     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  44         ``gfx902``
570     ================================= ========== =============================
571
572Sections
573--------
574
575An AMDGPU target ELF code object has the standard ELF sections which include:
576
577  .. table:: AMDGPU ELF Sections
578     :name: amdgpu-elf-sections-table
579
580     ================== ================ =================================
581     Name               Type             Attributes
582     ================== ================ =================================
583     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
584     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
585     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
586     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
587     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
588     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
589     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
590     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
591     ``.note``          ``SHT_NOTE``     *none*
592     ``.rela``\ *name*  ``SHT_RELA``     *none*
593     ``.rela.dyn``      ``SHT_RELA``     *none*
594     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
595     ``.shstrtab``      ``SHT_STRTAB``   *none*
596     ``.strtab``        ``SHT_STRTAB``   *none*
597     ``.symtab``        ``SHT_SYMTAB``   *none*
598     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
599     ================== ================ =================================
600
601These sections have their standard meanings (see [ELF]_) and are only generated
602if needed.
603
604``.debug``\ *\**
605  The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
606  DWARF produced by the AMDGPU backend.
607
608``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
609  The standard sections used by a dynamic loader.
610
611``.note``
612  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
613  backend.
614
615``.rela``\ *name*, ``.rela.dyn``
616  For relocatable code objects, *name* is the name of the section that the
617  relocation records apply. For example, ``.rela.text`` is the section name for
618  relocation records associated with the ``.text`` section.
619
620  For linked shared code objects, ``.rela.dyn`` contains all the relocation
621  records from each of the relocatable code object's ``.rela``\ *name* sections.
622
623  See :ref:`amdgpu-relocation-records` for the relocation records supported by
624  the AMDGPU backend.
625
626``.text``
627  The executable machine code for the kernels and functions they call. Generated
628  as position independent code. See :ref:`amdgpu-code-conventions` for
629  information on conventions used in the isa generation.
630
631.. _amdgpu-note-records:
632
633Note Records
634------------
635
636As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
637be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
638aligned. In addition, minimal zero byte padding must be generated to ensure the
639``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
640``.note`` section must be at least 4 to indicate at least 8 byte alignment.
641
642The AMDGPU backend code object uses the following ELF note records in the
643``.note`` section. The *Description* column specifies the layout of the note
644record's ``desc`` field. All fields are consecutive bytes. Note records with
645variable size strings have a corresponding ``*_size`` field that specifies the
646number of bytes, including the terminating null character, in the string. The
647string(s) come immediately after the preceding fields.
648
649Additional note records can be present.
650
651  .. table:: AMDGPU ELF Note Records
652     :name: amdgpu-elf-note-records-table
653
654     ===== ============================== ======================================
655     Name  Type                           Description
656     ===== ============================== ======================================
657     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
658     ===== ============================== ======================================
659
660..
661
662  .. table:: AMDGPU ELF Note Record Enumeration Values
663     :name: amdgpu-elf-note-record-enumeration-values-table
664
665     ============================== =====
666     Name                           Value
667     ============================== =====
668     *reserved*                       0-9
669     ``NT_AMD_AMDGPU_HSA_METADATA``    10
670     *reserved*                        11
671     ============================== =====
672
673``NT_AMD_AMDGPU_HSA_METADATA``
674  Specifies extensible metadata associated with the code objects executed on HSA
675  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
676  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
677  :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
678  object metadata string.
679
680.. _amdgpu-symbols:
681
682Symbols
683-------
684
685Symbols include the following:
686
687  .. table:: AMDGPU ELF Symbols
688     :name: amdgpu-elf-symbols-table
689
690     ===================== ============== ============= ==================
691     Name                  Type           Section       Description
692     ===================== ============== ============= ==================
693     *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
694                                          - ``.rodata``
695                                          - ``.bss``
696     *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
697     *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
698     ===================== ============== ============= ==================
699
700Global variable
701  Global variables both used and defined by the compilation unit.
702
703  If the symbol is defined in the compilation unit then it is allocated in the
704  appropriate section according to if it has initialized data or is readonly.
705
706  If the symbol is external then its section is ``STN_UNDEF`` and the loader
707  will resolve relocations using the definition provided by another code object
708  or explicitly defined by the runtime.
709
710  All global symbols, whether defined in the compilation unit or external, are
711  accessed by the machine code indirectly through a GOT table entry. This
712  allows them to be preemptable. The GOT table is only supported when the target
713  triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
714
715  .. TODO
716     Add description of linked shared object symbols. Seems undefined symbols
717     are marked as STT_NOTYPE.
718
719Kernel descriptor
720  Every HSA kernel has an associated kernel descriptor. It is the address of the
721  kernel descriptor that is used in the AQL dispatch packet used to invoke the
722  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
723  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
724
725Kernel entry point
726  Every HSA kernel also has a symbol for its machine code entry point.
727
728.. _amdgpu-relocation-records:
729
730Relocation Records
731------------------
732
733AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
734relocatable fields are:
735
736``word32``
737  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
738  alignment. These values use the same byte order as other word values in the
739  AMD GPU architecture.
740
741``word64``
742  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
743  alignment. These values use the same byte order as other word values in the
744  AMD GPU architecture.
745
746Following notations are used for specifying relocation calculations:
747
748**A**
749  Represents the addend used to compute the value of the relocatable field.
750
751**G**
752  Represents the offset into the global offset table at which the relocation
753  entry's symbol will reside during execution.
754
755**GOT**
756  Represents the address of the global offset table.
757
758**P**
759  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
760  of the storage unit being relocated (computed using ``r_offset``).
761
762**S**
763  Represents the value of the symbol whose index resides in the relocation
764  entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
765
766**B**
767  Represents the base address of a loaded executable or shared object which is
768  the difference between the ELF address and the actual load address. Relocations
769  using this are only valid in executable or shared objects.
770
771The following relocation types are supported:
772
773  .. table:: AMDGPU ELF Relocation Records
774     :name: amdgpu-elf-relocation-records-table
775
776     ==========================  =====  ==========  ==============================
777     Relocation Type             Value  Field       Calculation
778     ==========================  =====  ==========  ==============================
779     ``R_AMDGPU_NONE``           0      *none*      *none*
780     ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
781     ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
782     ``R_AMDGPU_ABS64``          3      ``word64``  S + A
783     ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
784     ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
785     ``R_AMDGPU_ABS32``          6      ``word32``  S + A
786     ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
787     ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
788     ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
789     ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
790     ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
791     *reserved*                  12
792     ``R_AMDGPU_RELATIVE64``     13     ``word64``  B + A
793     ==========================  =====  ==========  ==============================
794
795.. _amdgpu-dwarf:
796
797DWARF
798-----
799
800Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
801information that maps the code object executable code and data to the source
802language constructs. It can be used by tools such as debuggers and profilers.
803
804Address Space Mapping
805~~~~~~~~~~~~~~~~~~~~~
806
807The following address space mapping is used:
808
809  .. table:: AMDGPU DWARF Address Space Mapping
810     :name: amdgpu-dwarf-address-space-mapping-table
811
812     =================== =================
813     DWARF Address Space Memory Space
814     =================== =================
815     1                   Private (Scratch)
816     2                   Local (group/LDS)
817     *omitted*           Global
818     *omitted*           Constant
819     *omitted*           Generic (Flat)
820     *not supported*     Region (GDS)
821     =================== =================
822
823See :ref:`amdgpu-address-spaces` for information on the memory space terminology
824used in the table.
825
826An ``address_class`` attribute is generated on pointer type DIEs to specify the
827DWARF address space of the value of the pointer when it is in the *private* or
828*local* address space. Otherwise the attribute is omitted.
829
830An ``XDEREF`` operation is generated in location list expressions for variables
831that are allocated in the *private* and *local* address space. Otherwise no
832``XDREF`` is omitted.
833
834Register Mapping
835~~~~~~~~~~~~~~~~
836
837*This section is WIP.*
838
839.. TODO
840   Define DWARF register enumeration.
841
842   If want to present a wavefront state then should expose vector registers as
843   64 wide (rather than per work-item view that LLVM uses). Either as separate
844   registers, or a 64x4 byte single register. In either case use a new LANE op
845   (akin to XDREF) to select the current lane usage in a location
846   expression. This would also allow scalar register spilling to vector register
847   lanes to be expressed (currently no debug information is being generated for
848   spilling). If choose a wide single register approach then use LANE in
849   conjunction with PIECE operation to select the dword part of the register for
850   the current lane. If the separate register approach then use LANE to select
851   the register.
852
853Source Text
854~~~~~~~~~~~
855
856*This section is WIP.*
857
858.. TODO
859   DWARF extension to include runtime generated source text.
860
861.. _amdgpu-code-conventions:
862
863Code Conventions
864================
865
866This section provides code conventions used for each supported target triple OS
867(see :ref:`amdgpu-target-triples`).
868
869AMDHSA
870------
871
872This section provides code conventions used when the target triple OS is
873``amdhsa`` (see :ref:`amdgpu-target-triples`).
874
875.. _amdgpu-amdhsa-hsa-code-object-metadata:
876
877Code Object Metadata
878~~~~~~~~~~~~~~~~~~~~
879
880The code object metadata specifies extensible metadata associated with the code
881objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
882[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
883(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
884``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
885information necessary to support the ROCM kernel queries. For example, the
886segment sizes needed in a dispatch packet. In addition, a high level language
887runtime may require other information to be included. For example, the AMD
888OpenCL runtime records kernel argument information.
889
890The metadata is specified as a YAML formatted string (see [YAML]_ and
891:doc:`YamlIO`).
892
893.. TODO
894   Is the string null terminated? It probably should not if YAML allows it to
895   contain null characters, otherwise it should be.
896
897The metadata is represented as a single YAML document comprised of the mapping
898defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
899referenced tables.
900
901For boolean values, the string values of ``false`` and ``true`` are used for
902false and true respectively.
903
904Additional information can be added to the mappings. To avoid conflicts, any
905non-AMD key names should be prefixed by "*vendor-name*.".
906
907  .. table:: AMDHSA Code Object Metadata Mapping
908     :name: amdgpu-amdhsa-code-object-metadata-mapping-table
909
910     ========== ============== ========= =======================================
911     String Key Value Type     Required? Description
912     ========== ============== ========= =======================================
913     "Version"  sequence of    Required  - The first integer is the major
914                2 integers                 version. Currently 1.
915                                         - The second integer is the minor
916                                           version. Currently 0.
917     "Printf"   sequence of              Each string is encoded information
918                strings                  about a printf function call. The
919                                         encoded information is organized as
920                                         fields separated by colon (':'):
921
922                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
923
924                                         where:
925
926                                         ``ID``
927                                           A 32 bit integer as a unique id for
928                                           each printf function call
929
930                                         ``N``
931                                           A 32 bit integer equal to the number
932                                           of arguments of printf function call
933                                           minus 1
934
935                                         ``S[i]`` (where i = 0, 1, ... , N-1)
936                                           32 bit integers for the size in bytes
937                                           of the i-th FormatString argument of
938                                           the printf function call
939
940                                         FormatString
941                                           The format string passed to the
942                                           printf function call.
943     "Kernels"  sequence of    Required  Sequence of the mappings for each
944                mapping                  kernel in the code object. See
945                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
946                                         for the definition of the mapping.
947     ========== ============== ========= =======================================
948
949..
950
951  .. table:: AMDHSA Code Object Kernel Metadata Mapping
952     :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
953
954     ================= ============== ========= ================================
955     String Key        Value Type     Required? Description
956     ================= ============== ========= ================================
957     "Name"            string         Required  Source name of the kernel.
958     "SymbolName"      string         Required  Name of the kernel
959                                                descriptor ELF symbol.
960     "Language"        string                   Source language of the kernel.
961                                                Values include:
962
963                                                - "OpenCL C"
964                                                - "OpenCL C++"
965                                                - "HCC"
966                                                - "OpenMP"
967
968     "LanguageVersion" sequence of              - The first integer is the major
969                       2 integers                 version.
970                                                - The second integer is the
971                                                  minor version.
972     "Attrs"           mapping                  Mapping of kernel attributes.
973                                                See
974                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
975                                                for the mapping definition.
976     "Args"            sequence of              Sequence of mappings of the
977                       mapping                  kernel arguments. See
978                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
979                                                for the definition of the mapping.
980     "CodeProps"       mapping                  Mapping of properties related to
981                                                the kernel code. See
982                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
983                                                for the mapping definition.
984     ================= ============== ========= ================================
985
986..
987
988  .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
989     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
990
991     =================== ============== ========= ==============================
992     String Key          Value Type     Required? Description
993     =================== ============== ========= ==============================
994     "ReqdWorkGroupSize" sequence of              The dispatch work-group size
995                         3 integers               X, Y, Z must correspond to the
996                                                  specified values.
997
998                                                  Corresponds to the OpenCL
999                                                  ``reqd_work_group_size``
1000                                                  attribute.
1001     "WorkGroupSizeHint" sequence of              The dispatch work-group size
1002                         3 integers               X, Y, Z is likely to be the
1003                                                  specified values.
1004
1005                                                  Corresponds to the OpenCL
1006                                                  ``work_group_size_hint``
1007                                                  attribute.
1008     "VecTypeHint"       string                   The name of a scalar or vector
1009                                                  type.
1010
1011                                                  Corresponds to the OpenCL
1012                                                  ``vec_type_hint`` attribute.
1013
1014     "RuntimeHandle"     string                   The external symbol name
1015                                                  associated with a kernel.
1016                                                  OpenCL runtime allocates a
1017                                                  global buffer for the symbol
1018                                                  and saves the kernel's address
1019                                                  to it, which is used for
1020                                                  device side enqueueing. Only
1021                                                  available for device side
1022                                                  enqueued kernels.
1023     =================== ============== ========= ==============================
1024
1025..
1026
1027  .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1028     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1029
1030     ================= ============== ========= ================================
1031     String Key        Value Type     Required? Description
1032     ================= ============== ========= ================================
1033     "Name"            string                   Kernel argument name.
1034     "TypeName"        string                   Kernel argument type name.
1035     "Size"            integer        Required  Kernel argument size in bytes.
1036     "Align"           integer        Required  Kernel argument alignment in
1037                                                bytes. Must be a power of two.
1038     "ValueKind"       string         Required  Kernel argument kind that
1039                                                specifies how to set up the
1040                                                corresponding argument.
1041                                                Values include:
1042
1043                                                "ByValue"
1044                                                  The argument is copied
1045                                                  directly into the kernarg.
1046
1047                                                "GlobalBuffer"
1048                                                  A global address space pointer
1049                                                  to the buffer data is passed
1050                                                  in the kernarg.
1051
1052                                                "DynamicSharedPointer"
1053                                                  A group address space pointer
1054                                                  to dynamically allocated LDS
1055                                                  is passed in the kernarg.
1056
1057                                                "Sampler"
1058                                                  A global address space
1059                                                  pointer to a S# is passed in
1060                                                  the kernarg.
1061
1062                                                "Image"
1063                                                  A global address space
1064                                                  pointer to a T# is passed in
1065                                                  the kernarg.
1066
1067                                                "Pipe"
1068                                                  A global address space pointer
1069                                                  to an OpenCL pipe is passed in
1070                                                  the kernarg.
1071
1072                                                "Queue"
1073                                                  A global address space pointer
1074                                                  to an OpenCL device enqueue
1075                                                  queue is passed in the
1076                                                  kernarg.
1077
1078                                                "HiddenGlobalOffsetX"
1079                                                  The OpenCL grid dispatch
1080                                                  global offset for the X
1081                                                  dimension is passed in the
1082                                                  kernarg.
1083
1084                                                "HiddenGlobalOffsetY"
1085                                                  The OpenCL grid dispatch
1086                                                  global offset for the Y
1087                                                  dimension is passed in the
1088                                                  kernarg.
1089
1090                                                "HiddenGlobalOffsetZ"
1091                                                  The OpenCL grid dispatch
1092                                                  global offset for the Z
1093                                                  dimension is passed in the
1094                                                  kernarg.
1095
1096                                                "HiddenNone"
1097                                                  An argument that is not used
1098                                                  by the kernel. Space needs to
1099                                                  be left for it, but it does
1100                                                  not need to be set up.
1101
1102                                                "HiddenPrintfBuffer"
1103                                                  A global address space pointer
1104                                                  to the runtime printf buffer
1105                                                  is passed in kernarg.
1106
1107                                                "HiddenDefaultQueue"
1108                                                  A global address space pointer
1109                                                  to the OpenCL device enqueue
1110                                                  queue that should be used by
1111                                                  the kernel by default is
1112                                                  passed in the kernarg.
1113
1114                                                "HiddenCompletionAction"
1115                                                  A global address space pointer
1116                                                  to help link enqueued kernels into
1117                                                  the ancestor tree for determining
1118                                                  when the parent kernel has finished.
1119
1120     "ValueType"       string         Required  Kernel argument value type. Only
1121                                                present if "ValueKind" is
1122                                                "ByValue". For vector data
1123                                                types, the value is for the
1124                                                element type. Values include:
1125
1126                                                - "Struct"
1127                                                - "I8"
1128                                                - "U8"
1129                                                - "I16"
1130                                                - "U16"
1131                                                - "F16"
1132                                                - "I32"
1133                                                - "U32"
1134                                                - "F32"
1135                                                - "I64"
1136                                                - "U64"
1137                                                - "F64"
1138
1139                                                .. TODO
1140                                                   How can it be determined if a
1141                                                   vector type, and what size
1142                                                   vector?
1143     "PointeeAlign"    integer                  Alignment in bytes of pointee
1144                                                type for pointer type kernel
1145                                                argument. Must be a power
1146                                                of 2. Only present if
1147                                                "ValueKind" is
1148                                                "DynamicSharedPointer".
1149     "AddrSpaceQual"   string                   Kernel argument address space
1150                                                qualifier. Only present if
1151                                                "ValueKind" is "GlobalBuffer" or
1152                                                "DynamicSharedPointer". Values
1153                                                are:
1154
1155                                                - "Private"
1156                                                - "Global"
1157                                                - "Constant"
1158                                                - "Local"
1159                                                - "Generic"
1160                                                - "Region"
1161
1162                                                .. TODO
1163                                                   Is GlobalBuffer only Global
1164                                                   or Constant? Is
1165                                                   DynamicSharedPointer always
1166                                                   Local? Can HCC allow Generic?
1167                                                   How can Private or Region
1168                                                   ever happen?
1169     "AccQual"         string                   Kernel argument access
1170                                                qualifier. Only present if
1171                                                "ValueKind" is "Image" or
1172                                                "Pipe". Values
1173                                                are:
1174
1175                                                - "ReadOnly"
1176                                                - "WriteOnly"
1177                                                - "ReadWrite"
1178
1179                                                .. TODO
1180                                                   Does this apply to
1181                                                   GlobalBuffer?
1182     "ActualAccQual"   string                   The actual memory accesses
1183                                                performed by the kernel on the
1184                                                kernel argument. Only present if
1185                                                "ValueKind" is "GlobalBuffer",
1186                                                "Image", or "Pipe". This may be
1187                                                more restrictive than indicated
1188                                                by "AccQual" to reflect what the
1189                                                kernel actual does. If not
1190                                                present then the runtime must
1191                                                assume what is implied by
1192                                                "AccQual" and "IsConst". Values
1193                                                are:
1194
1195                                                - "ReadOnly"
1196                                                - "WriteOnly"
1197                                                - "ReadWrite"
1198
1199     "IsConst"         boolean                  Indicates if the kernel argument
1200                                                is const qualified. Only present
1201                                                if "ValueKind" is
1202                                                "GlobalBuffer".
1203
1204     "IsRestrict"      boolean                  Indicates if the kernel argument
1205                                                is restrict qualified. Only
1206                                                present if "ValueKind" is
1207                                                "GlobalBuffer".
1208
1209     "IsVolatile"      boolean                  Indicates if the kernel argument
1210                                                is volatile qualified. Only
1211                                                present if "ValueKind" is
1212                                                "GlobalBuffer".
1213
1214     "IsPipe"          boolean                  Indicates if the kernel argument
1215                                                is pipe qualified. Only present
1216                                                if "ValueKind" is "Pipe".
1217
1218                                                .. TODO
1219                                                   Can GlobalBuffer be pipe
1220                                                   qualified?
1221     ================= ============== ========= ================================
1222
1223..
1224
1225  .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1226     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1227
1228     ============================ ============== ========= =====================
1229     String Key                   Value Type     Required? Description
1230     ============================ ============== ========= =====================
1231     "KernargSegmentSize"         integer        Required  The size in bytes of
1232                                                           the kernarg segment
1233                                                           that holds the values
1234                                                           of the arguments to
1235                                                           the kernel.
1236     "GroupSegmentFixedSize"      integer        Required  The amount of group
1237                                                           segment memory
1238                                                           required by a
1239                                                           work-group in
1240                                                           bytes. This does not
1241                                                           include any
1242                                                           dynamically allocated
1243                                                           group segment memory
1244                                                           that may be added
1245                                                           when the kernel is
1246                                                           dispatched.
1247     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1248                                                           private address space
1249                                                           memory required for a
1250                                                           work-item in
1251                                                           bytes. If the kernel
1252                                                           uses a dynamic call
1253                                                           stack then additional
1254                                                           space must be added
1255                                                           to this value for the
1256                                                           call stack.
1257     "KernargSegmentAlign"        integer        Required  The maximum byte
1258                                                           alignment of
1259                                                           arguments in the
1260                                                           kernarg segment. Must
1261                                                           be a power of 2.
1262     "WavefrontSize"              integer        Required  Wavefront size. Must
1263                                                           be a power of 2.
1264     "NumSGPRs"                   integer        Required  Number of scalar
1265                                                           registers used by a
1266                                                           wavefront for
1267                                                           GFX6-GFX9. This
1268                                                           includes the special
1269                                                           SGPRs for VCC, Flat
1270                                                           Scratch (GFX7-GFX9)
1271                                                           and XNACK (for
1272                                                           GFX8-GFX9). It does
1273                                                           not include the 16
1274                                                           SGPR added if a trap
1275                                                           handler is
1276                                                           enabled. It is not
1277                                                           rounded up to the
1278                                                           allocation
1279                                                           granularity.
1280     "NumVGPRs"                   integer        Required  Number of vector
1281                                                           registers used by
1282                                                           each work-item for
1283                                                           GFX6-GFX9
1284     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1285                                                           work-group size
1286                                                           supported by the
1287                                                           kernel in work-items.
1288                                                           Must be >=1 and
1289                                                           consistent with any
1290                                                           non-0 values in
1291                                                           FixedWorkGroupSize.
1292     "FixedWorkGroupSize"         sequence of              Corresponds to the
1293                                  3 integers               dispatch work-group
1294                                                           size X, Y, Z. If
1295                                                           omitted, defaults to
1296                                                           0, 0, 0. If an
1297                                                           element is non-0 then
1298                                                           the kernel must only
1299                                                           be launched with a
1300                                                           matching corresponding
1301                                                           work-group size.
1302     "NumSpilledSGPRs"            integer                  Number of stores from
1303                                                           a scalar register to
1304                                                           a register allocator
1305                                                           created spill
1306                                                           location.
1307     "NumSpilledVGPRs"            integer                  Number of stores from
1308                                                           a vector register to
1309                                                           a register allocator
1310                                                           created spill
1311                                                           location.
1312     ============================ ============== ========= =====================
1313
1314..
1315
1316Kernel Dispatch
1317~~~~~~~~~~~~~~~
1318
1319The HSA architected queuing language (AQL) defines a user space memory interface
1320that can be used to control the dispatch of kernels, in an agent independent
1321way. An agent can have zero or more AQL queues created for it using the ROCm
1322runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1323*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1324mechanics and packet layouts.
1325
1326The packet processor of a kernel agent is responsible for detecting and
1327dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1328packet processor is implemented by the hardware command processor (CP),
1329asynchronous dispatch controller (ADC) and shader processor input controller
1330(SPI).
1331
1332The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1333mode driver to initialize and register the AQL queue with CP.
1334
1335To dispatch a kernel the following actions are performed. This can occur in the
1336CPU host program, or from an HSA kernel executing on a GPU.
1337
13381. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1339   executed is obtained.
13402. A pointer to the kernel descriptor (see
1341   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1342   obtained. It must be for a kernel that is contained in a code object that that
1343   was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1344   associated.
13453. Space is allocated for the kernel arguments using the ROCm runtime allocator
1346   for a memory region with the kernarg property for the kernel agent that will
1347   execute the kernel. It must be at least 16 byte aligned.
13484. Kernel argument values are assigned to the kernel argument memory
1349   allocation. The layout is defined in the *HSA Programmer's Language Reference*
1350   [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1351   memory in the same way constant memory is accessed. (Note that the HSA
1352   specification allows an implementation to copy the kernel argument contents to
1353   another location that is accessed by the kernel.)
13545. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1355   api uses 64 bit atomic operations to reserve space in the AQL queue for the
1356   packet. The packet must be set up, and the final write must use an atomic
1357   store release to set the packet kind to ensure the packet contents are
1358   visible to the kernel agent. AQL defines a doorbell signal mechanism to
1359   notify the kernel agent that the AQL queue has been updated. These rules, and
1360   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1361   System Architecture Specification* [HSA]_.
13626. A kernel dispatch packet includes information about the actual dispatch,
1363   such as grid and work-group size, together with information from the code
1364   object about the kernel, such as segment sizes. The ROCm runtime queries on
1365   the kernel symbol can be used to obtain the code object values which are
1366   recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
13677. CP executes micro-code and is responsible for detecting and setting up the
1368   GPU to execute the wavefronts of a kernel dispatch.
13698. CP ensures that when the a wavefront starts executing the kernel machine
1370   code, the scalar general purpose registers (SGPR) and vector general purpose
1371   registers (VGPR) are set up as required by the machine code. The required
1372   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1373   register state is defined in
1374   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13759. The prolog of the kernel machine code (see
1376   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1377   before continuing executing the machine code that corresponds to the kernel.
137810. When the kernel dispatch has completed execution, CP signals the completion
1379    signal specified in the kernel dispatch packet if not 0.
1380
1381.. _amdgpu-amdhsa-memory-spaces:
1382
1383Memory Spaces
1384~~~~~~~~~~~~~
1385
1386The memory space properties are:
1387
1388  .. table:: AMDHSA Memory Spaces
1389     :name: amdgpu-amdhsa-memory-spaces-table
1390
1391     ================= =========== ======== ======= ==================
1392     Memory Space Name HSA Segment Hardware Address NULL Value
1393                       Name        Name     Size
1394     ================= =========== ======== ======= ==================
1395     Private           private     scratch  32      0x00000000
1396     Local             group       LDS      32      0xFFFFFFFF
1397     Global            global      global   64      0x0000000000000000
1398     Constant          constant    *same as 64      0x0000000000000000
1399                                   global*
1400     Generic           flat        flat     64      0x0000000000000000
1401     Region            N/A         GDS      32      *not implemented
1402                                                    for AMDHSA*
1403     ================= =========== ======== ======= ==================
1404
1405The global and constant memory spaces both use global virtual addresses, which
1406are the same virtual address space used by the CPU. However, some virtual
1407addresses may only be accessible to the CPU, some only accessible by the GPU,
1408and some by both.
1409
1410Using the constant memory space indicates that the data will not change during
1411the execution of the kernel. This allows scalar read instructions to be
1412used. The vector and scalar L1 caches are invalidated of volatile data before
1413each kernel dispatch execution to allow constant memory to change values between
1414kernel dispatches.
1415
1416The local memory space uses the hardware Local Data Store (LDS) which is
1417automatically allocated when the hardware creates work-groups of wavefronts, and
1418freed when all the wavefronts of a work-group have terminated. The data store
1419(DS) instructions can be used to access it.
1420
1421The private memory space uses the hardware scratch memory support. If the kernel
1422uses scratch, then the hardware allocates memory that is accessed using
1423wavefront lane dword (4 byte) interleaving. The mapping used from private
1424address to physical address is:
1425
1426  ``wavefront-scratch-base +
1427  (private-address * wavefront-size * 4) +
1428  (wavefront-lane-id * 4)``
1429
1430There are different ways that the wavefront scratch base address is determined
1431by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1432memory can be accessed in an interleaved manner using buffer instruction with
1433the scratch buffer descriptor and per wave scratch offset, by the scratch
1434instructions, or by flat instructions. If each lane of a wavefront accesses the
1435same private address, the interleaving results in adjacent dwords being accessed
1436and hence requires fewer cache lines to be fetched. Multi-dword access is not
1437supported except by flat and scratch instructions in GFX9.
1438
1439The generic address space uses the hardware flat address support available in
1440GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1441local appertures), that are outside the range of addressible global memory, to
1442map from a flat address to a private or local address.
1443
1444FLAT instructions can take a flat address and access global, private (scratch)
1445and group (LDS) memory depending in if the address is within one of the
1446apperture ranges. Flat access to scratch requires hardware aperture setup and
1447setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1448access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1449(see :ref:`amdgpu-amdhsa-m0`).
1450
1451To convert between a segment address and a flat address the base address of the
1452appertures address can be used. For GFX7-GFX8 these are available in the
1453:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1454Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1455GFX9 the appature base addresses are directly available as inline constant
1456registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1457address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1458which makes it easier to convert from flat to segment or segment to flat.
1459
1460Image and Samplers
1461~~~~~~~~~~~~~~~~~~
1462
1463Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1464hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1465HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1466enumeration values for the queries that are not trivially deducible from the S#
1467representation.
1468
1469HSA Signals
1470~~~~~~~~~~~
1471
1472HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1473structure allocated in memory accessible from both the CPU and GPU. The
1474structure is defined by the ROCm runtime and subject to change between releases
1475(see [AMD-ROCm-github]_).
1476
1477.. _amdgpu-amdhsa-hsa-aql-queue:
1478
1479HSA AQL Queue
1480~~~~~~~~~~~~~
1481
1482The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1483between releases (see [AMD-ROCm-github]_). For some processors it contains
1484fields needed to implement certain language features such as the flat address
1485aperture bases. It also contains fields used by CP such as managing the
1486allocation of scratch memory.
1487
1488.. _amdgpu-amdhsa-kernel-descriptor:
1489
1490Kernel Descriptor
1491~~~~~~~~~~~~~~~~~
1492
1493A kernel descriptor consists of the information needed by CP to initiate the
1494execution of a kernel, including the entry point address of the machine code
1495that implements the kernel.
1496
1497Kernel Descriptor for GFX6-GFX9
1498+++++++++++++++++++++++++++++++
1499
1500CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1501
1502  .. table:: Kernel Descriptor for GFX6-GFX9
1503     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1504
1505     ======= ======= =============================== ============================
1506     Bits    Size    Field Name                      Description
1507     ======= ======= =============================== ============================
1508     31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1509                                                     address space memory
1510                                                     required for a work-group
1511                                                     in bytes. This does not
1512                                                     include any dynamically
1513                                                     allocated local address
1514                                                     space memory that may be
1515                                                     added when the kernel is
1516                                                     dispatched.
1517     63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1518                                                     private address space
1519                                                     memory required for a
1520                                                     work-item in bytes. If
1521                                                     is_dynamic_callstack is 1
1522                                                     then additional space must
1523                                                     be added to this value for
1524                                                     the call stack.
1525     127:64  8 bytes                                 Reserved, must be 0.
1526     191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1527                                                     negative) from base
1528                                                     address of kernel
1529                                                     descriptor to kernel's
1530                                                     entry point instruction
1531                                                     which must be 256 byte
1532                                                     aligned.
1533     223:192 4 bytes MaxFlatWorkGroupSize            Maximum flat work-group
1534                                                     size supported by the
1535                                                     kernel in work-items. If
1536                                                     an exact work-group size
1537                                                     is required then must be
1538                                                     omitted or 0 and
1539                                                     ReqdWorkGroupSize* must
1540                                                     be set to non-0.
1541     239:224 2 bytes ReqdWorkGroupSizeX              If present and non-0 then
1542                                                     the kernel
1543                                                     must be executed with the
1544                                                     specified work-group size
1545                                                     for X.
1546     255:240 2 bytes ReqdWorkGroupSizeY              If present and non-0 then
1547                                                     the kernel
1548                                                     must be executed with the
1549                                                     specified work-group size
1550                                                     for Y.
1551     271:256 2 bytes ReqdWorkGroupSizeZ              If present and non-0 then
1552                                                     the kernel
1553                                                     must be executed with the
1554                                                     specified work-group size
1555                                                     for Z.
1556     383:272 14                                      Reserved, must be 0.
1557             bytes
1558     415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1559                                                     program settings used by
1560                                                     CP to set up
1561                                                     ``COMPUTE_PGM_RSRC1``
1562                                                     configuration
1563                                                     register. See
1564                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1565     447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1566                                                     program settings used by
1567                                                     CP to set up
1568                                                     ``COMPUTE_PGM_RSRC2``
1569                                                     configuration
1570                                                     register. See
1571                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1572     448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1573                                                     SGPR user data registers
1574                                                     (see
1575                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1576
1577                                                     The total number of SGPR
1578                                                     user data registers
1579                                                     requested must not exceed
1580                                                     16 and match value in
1581                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1582                                                     Any requests beyond 16
1583                                                     will be ignored.
1584     449     1 bit   EnableSGPRDispatchPtr           *see above*
1585     450     1 bit   EnableSGPRQueuePtr              *see above*
1586     451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1587     452     1 bit   EnableSGPRDispatchID            *see above*
1588     453     1 bit   EnableSGPRFlatScratchInit       *see above*
1589     454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1590     455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1591                                                     should always be 0.
1592     456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1593                                                     should always be 0.
1594     457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1595                                                     should always be 0.
1596     463:458 6 bits                                  Reserved, must be 0.
1597     511:464 6                                       Reserved, must be 0.
1598             bytes
1599     512     **Total size 64 bytes.**
1600     ======= ====================================================================
1601
1602..
1603
1604  .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1605     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1606
1607     ======= ======= =============================== ===========================================================================
1608     Bits    Size    Field Name                      Description
1609     ======= ======= =============================== ===========================================================================
1610     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1611                                                     used by each work-item,
1612                                                     granularity is device
1613                                                     specific:
1614
1615                                                     GFX6-GFX9
1616                                                       - max_vgpr 1..256
1617                                                       - roundup((max_vgpg + 1)
1618                                                         / 4) - 1
1619
1620                                                     Used by CP to set up
1621                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
1622     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1623                                                     used by a wavefront,
1624                                                     granularity is device
1625                                                     specific:
1626
1627                                                     GFX6-GFX8
1628                                                       - max_sgpr 1..112
1629                                                       - roundup((max_sgpg + 1)
1630                                                         / 8) - 1
1631                                                     GFX9
1632                                                       - max_sgpr 1..112
1633                                                       - roundup((max_sgpg + 1)
1634                                                         / 16) - 1
1635
1636                                                     Includes the special SGPRs
1637                                                     for VCC, Flat Scratch (for
1638                                                     GFX7 onwards) and XNACK
1639                                                     (for GFX8 onwards). It does
1640                                                     not include the 16 SGPR
1641                                                     added if a trap handler is
1642                                                     enabled.
1643
1644                                                     Used by CP to set up
1645                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
1646     11:10   2 bits  PRIORITY                        Must be 0.
1647
1648                                                     Start executing wavefront
1649                                                     at the specified priority.
1650
1651                                                     CP is responsible for
1652                                                     filling in
1653                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
1654     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1655                                                     with specified rounding
1656                                                     mode for single (32
1657                                                     bit) floating point
1658                                                     precision floating point
1659                                                     operations.
1660
1661                                                     Floating point rounding
1662                                                     mode values are defined in
1663                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1664
1665                                                     Used by CP to set up
1666                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1667     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1668                                                     with specified rounding
1669                                                     denorm mode for half/double (16
1670                                                     and 64 bit) floating point
1671                                                     precision floating point
1672                                                     operations.
1673
1674                                                     Floating point rounding
1675                                                     mode values are defined in
1676                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1677
1678                                                     Used by CP to set up
1679                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1680     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1681                                                     with specified denorm mode
1682                                                     for single (32
1683                                                     bit)  floating point
1684                                                     precision floating point
1685                                                     operations.
1686
1687                                                     Floating point denorm mode
1688                                                     values are defined in
1689                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1690
1691                                                     Used by CP to set up
1692                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1693     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1694                                                     with specified denorm mode
1695                                                     for half/double (16
1696                                                     and 64 bit) floating point
1697                                                     precision floating point
1698                                                     operations.
1699
1700                                                     Floating point denorm mode
1701                                                     values are defined in
1702                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1703
1704                                                     Used by CP to set up
1705                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1706     20      1 bit   PRIV                            Must be 0.
1707
1708                                                     Start executing wavefront
1709                                                     in privilege trap handler
1710                                                     mode.
1711
1712                                                     CP is responsible for
1713                                                     filling in
1714                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
1715     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1716                                                     with DX10 clamp mode
1717                                                     enabled. Used by the vector
1718                                                     ALU to force DX10 style
1719                                                     treatment of NaN's (when
1720                                                     set, clamp NaN to zero,
1721                                                     otherwise pass NaN
1722                                                     through).
1723
1724                                                     Used by CP to set up
1725                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1726     22      1 bit   DEBUG_MODE                      Must be 0.
1727
1728                                                     Start executing wavefront
1729                                                     in single step mode.
1730
1731                                                     CP is responsible for
1732                                                     filling in
1733                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1734     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1735                                                     with IEEE mode
1736                                                     enabled. Floating point
1737                                                     opcodes that support
1738                                                     exception flag gathering
1739                                                     will quiet and propagate
1740                                                     signaling-NaN inputs per
1741                                                     IEEE 754-2008. Min_dx10 and
1742                                                     max_dx10 become IEEE
1743                                                     754-2008 compliant due to
1744                                                     signaling-NaN propagation
1745                                                     and quieting.
1746
1747                                                     Used by CP to set up
1748                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1749     24      1 bit   BULKY                           Must be 0.
1750
1751                                                     Only one work-group allowed
1752                                                     to execute on a compute
1753                                                     unit.
1754
1755                                                     CP is responsible for
1756                                                     filling in
1757                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
1758     25      1 bit   CDBG_USER                       Must be 0.
1759
1760                                                     Flag that can be used to
1761                                                     control debugging code.
1762
1763                                                     CP is responsible for
1764                                                     filling in
1765                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1766     26      1 bit   FP16_OVFL                       GFX6-GFX8
1767                                                       Reserved, must be 0.
1768                                                     GFX9
1769                                                       Wavefront starts execution
1770                                                       with specified fp16 overflow
1771                                                       mode.
1772
1773                                                       - If 0, fp16 overflow generates
1774                                                         +/-INF values.
1775                                                       - If 1, fp16 overflow that is the
1776                                                         result of an +/-INF input value
1777                                                         or divide by 0 produces a +/-INF,
1778                                                         otherwise clamps computed
1779                                                         overflow to +/-MAX_FP16 as
1780                                                         appropriate.
1781
1782                                                       Used by CP to set up
1783                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1784     31:27   5 bits                                  Reserved, must be 0.
1785     32      **Total size 4 bytes**
1786     ======= ===================================================================================================================
1787
1788..
1789
1790  .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1791     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1792
1793     ======= ======= =============================== ===========================================================================
1794     Bits    Size    Field Name                      Description
1795     ======= ======= =============================== ===========================================================================
1796     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1797                     _WAVE_OFFSET                    SGPR wave scratch offset
1798                                                     system register (see
1799                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1800
1801                                                     Used by CP to set up
1802                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1803     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1804                                                     user data registers
1805                                                     requested. This number must
1806                                                     match the number of user
1807                                                     data registers enabled.
1808
1809                                                     Used by CP to set up
1810                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1811     6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1812                                                     TRAP instruction which
1813                                                     requires a trap handler to
1814                                                     be enabled.
1815
1816                                                     CP sets
1817                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1818                                                     if the runtime has
1819                                                     installed a trap handler
1820                                                     regardless of the setting
1821                                                     of this field.
1822     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1823                                                     system SGPR register for
1824                                                     the work-group id in the X
1825                                                     dimension (see
1826                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1827
1828                                                     Used by CP to set up
1829                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1830     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1831                                                     system SGPR register for
1832                                                     the work-group id in the Y
1833                                                     dimension (see
1834                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1835
1836                                                     Used by CP to set up
1837                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1838     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1839                                                     system SGPR register for
1840                                                     the work-group id in the Z
1841                                                     dimension (see
1842                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1843
1844                                                     Used by CP to set up
1845                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1846     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1847                                                     system SGPR register for
1848                                                     work-group information (see
1849                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851                                                     Used by CP to set up
1852                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1853     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1854                                                     VGPR system registers used
1855                                                     for the work-item ID.
1856                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1857                                                     defines the values.
1858
1859                                                     Used by CP to set up
1860                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1861     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1862
1863                                                     Wavefront starts execution
1864                                                     with address watch
1865                                                     exceptions enabled which
1866                                                     are generated when L1 has
1867                                                     witnessed a thread access
1868                                                     an *address of
1869                                                     interest*.
1870
1871                                                     CP is responsible for
1872                                                     filling in the address
1873                                                     watch bit in
1874                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1875                                                     according to what the
1876                                                     runtime requests.
1877     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1878
1879                                                     Wavefront starts execution
1880                                                     with memory violation
1881                                                     exceptions exceptions
1882                                                     enabled which are generated
1883                                                     when a memory violation has
1884                                                     occurred for this wave from
1885                                                     L1 or LDS
1886                                                     (write-to-read-only-memory,
1887                                                     mis-aligned atomic, LDS
1888                                                     address out of range,
1889                                                     illegal address, etc.).
1890
1891                                                     CP sets the memory
1892                                                     violation bit in
1893                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1894                                                     according to what the
1895                                                     runtime requests.
1896     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1897
1898                                                     CP uses the rounded value
1899                                                     from the dispatch packet,
1900                                                     not this value, as the
1901                                                     dispatch may contain
1902                                                     dynamically allocated group
1903                                                     segment memory. CP writes
1904                                                     directly to
1905                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1906
1907                                                     Amount of group segment
1908                                                     (LDS) to allocate for each
1909                                                     work-group. Granularity is
1910                                                     device specific:
1911
1912                                                     GFX6:
1913                                                       roundup(lds-size / (64 * 4))
1914                                                     GFX7-GFX9:
1915                                                       roundup(lds-size / (128 * 4))
1916
1917     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1918                     _INVALID_OPERATION              with specified exceptions
1919                                                     enabled.
1920
1921                                                     Used by CP to set up
1922                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
1923                                                     (set from bits 0..6).
1924
1925                                                     IEEE 754 FP Invalid
1926                                                     Operation
1927     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1928                     _SOURCE                         input operands is a
1929                                                     denormal number
1930     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1931                     _DIVISION_BY_ZERO               Zero
1932     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1933                     _OVERFLOW
1934     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1935                     _UNDERFLOW
1936     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1937                     _INEXACT
1938     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1939                     _ZERO                           (rcp_iflag_f32 instruction
1940                                                     only)
1941     31      1 bit                                   Reserved, must be 0.
1942     32      **Total size 4 bytes.**
1943     ======= ===================================================================================================================
1944
1945..
1946
1947  .. table:: Floating Point Rounding Mode Enumeration Values
1948     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1949
1950     ====================================== ===== ==============================
1951     Enumeration Name                       Value Description
1952     ====================================== ===== ==============================
1953     AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
1954     AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
1955     AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
1956     AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
1957     ====================================== ===== ==============================
1958
1959..
1960
1961  .. table:: Floating Point Denorm Mode Enumeration Values
1962     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1963
1964     ====================================== ===== ==============================
1965     Enumeration Name                       Value Description
1966     ====================================== ===== ==============================
1967     AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
1968                                                  Denorms
1969     AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
1970     AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
1971     AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
1972     ====================================== ===== ==============================
1973
1974..
1975
1976  .. table:: System VGPR Work-Item ID Enumeration Values
1977     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1978
1979     ======================================== ===== ============================
1980     Enumeration Name                         Value Description
1981     ======================================== ===== ============================
1982     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
1983                                                    ID.
1984     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1985                                                    dimensions ID.
1986     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1987                                                    dimensions ID.
1988     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1989     ======================================== ===== ============================
1990
1991.. _amdgpu-amdhsa-initial-kernel-execution-state:
1992
1993Initial Kernel Execution State
1994~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1995
1996This section defines the register state that will be set up by the packet
1997processor prior to the start of execution of every wavefront. This is limited by
1998the constraints of the hardware controllers of CP/ADC/SPI.
1999
2000The order of the SGPR registers is defined, but the compiler can specify which
2001ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2002fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2003for enabled registers are dense starting at SGPR0: the first enabled register is
2004SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2005an SGPR number.
2006
2007The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2008all waves of the grid. It is possible to specify more than 16 User SGPRs using
2009the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2010initialized. These are then immediately followed by the System SGPRs that are
2011set up by ADC/SPI and can have different values for each wave of the grid
2012dispatch.
2013
2014SGPR register initial state is defined in
2015:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2016
2017  .. table:: SGPR Register Set Up Order
2018     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2019
2020     ========== ========================== ====== ==============================
2021     SGPR Order Name                       Number Description
2022                (kernel descriptor enable  of
2023                field)                     SGPRs
2024     ========== ========================== ====== ==============================
2025     First      Private Segment Buffer     4      V# that can be used, together
2026                (enable_sgpr_private              with Scratch Wave Offset as an
2027                _segment_buffer)                  offset, to access the private
2028                                                  memory space using a segment
2029                                                  address.
2030
2031                                                  CP uses the value provided by
2032                                                  the runtime.
2033     then       Dispatch Ptr               2      64 bit address of AQL dispatch
2034                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2035                                                  actually executing.
2036     then       Queue Ptr                  2      64 bit address of amd_queue_t
2037                (enable_sgpr_queue_ptr)           object for AQL queue on which
2038                                                  the dispatch packet was
2039                                                  queued.
2040     then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2041                (enable_sgpr_kernarg              segment. This is directly
2042                _segment_ptr)                     copied from the
2043                                                  kernarg_address in the kernel
2044                                                  dispatch packet.
2045
2046                                                  Having CP load it once avoids
2047                                                  loading it at the beginning of
2048                                                  every wavefront.
2049     then       Dispatch Id                2      64 bit Dispatch ID of the
2050                (enable_sgpr_dispatch_id)         dispatch packet being
2051                                                  executed.
2052     then       Flat Scratch Init          2      This is 2 SGPRs:
2053                (enable_sgpr_flat_scratch
2054                _init)                            GFX6
2055                                                    Not supported.
2056                                                  GFX7-GFX8
2057                                                    The first SGPR is a 32 bit
2058                                                    byte offset from
2059                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2060                                                    to per SPI base of memory
2061                                                    for scratch for the queue
2062                                                    executing the kernel
2063                                                    dispatch. CP obtains this
2064                                                    from the runtime. (The
2065                                                    Scratch Segment Buffer base
2066                                                    address is
2067                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2068                                                    plus this offset.) The value
2069                                                    of Scratch Wave Offset must
2070                                                    be added to this offset by
2071                                                    the kernel machine code,
2072                                                    right shifted by 8, and
2073                                                    moved to the FLAT_SCRATCH_HI
2074                                                    SGPR register.
2075                                                    FLAT_SCRATCH_HI corresponds
2076                                                    to SGPRn-4 on GFX7, and
2077                                                    SGPRn-6 on GFX8 (where SGPRn
2078                                                    is the highest numbered SGPR
2079                                                    allocated to the wave).
2080                                                    FLAT_SCRATCH_HI is
2081                                                    multiplied by 256 (as it is
2082                                                    in units of 256 bytes) and
2083                                                    added to
2084                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2085                                                    to calculate the per wave
2086                                                    FLAT SCRATCH BASE in flat
2087                                                    memory instructions that
2088                                                    access the scratch
2089                                                    apperture.
2090
2091                                                    The second SGPR is 32 bit
2092                                                    byte size of a single
2093                                                    work-item's scratch memory
2094                                                    usage. CP obtains this from
2095                                                    the runtime, and it is
2096                                                    always a multiple of DWORD.
2097                                                    CP checks that the value in
2098                                                    the kernel dispatch packet
2099                                                    Private Segment Byte Size is
2100                                                    not larger, and requests the
2101                                                    runtime to increase the
2102                                                    queue's scratch size if
2103                                                    necessary. The kernel code
2104                                                    must move it to
2105                                                    FLAT_SCRATCH_LO which is
2106                                                    SGPRn-3 on GFX7 and SGPRn-5
2107                                                    on GFX8. FLAT_SCRATCH_LO is
2108                                                    used as the FLAT SCRATCH
2109                                                    SIZE in flat memory
2110                                                    instructions. Having CP load
2111                                                    it once avoids loading it at
2112                                                    the beginning of every
2113                                                    wavefront.
2114                                                  GFX9
2115                                                    This is the
2116                                                    64 bit base address of the
2117                                                    per SPI scratch backing
2118                                                    memory managed by SPI for
2119                                                    the queue executing the
2120                                                    kernel dispatch. CP obtains
2121                                                    this from the runtime (and
2122                                                    divides it if there are
2123                                                    multiple Shader Arrays each
2124                                                    with its own SPI). The value
2125                                                    of Scratch Wave Offset must
2126                                                    be added by the kernel
2127                                                    machine code and the result
2128                                                    moved to the FLAT_SCRATCH
2129                                                    SGPR which is SGPRn-6 and
2130                                                    SGPRn-5. It is used as the
2131                                                    FLAT SCRATCH BASE in flat
2132                                                    memory instructions.
2133     then       Private Segment Size       1      The 32 bit byte size of a
2134                                                  (enable_sgpr_private single
2135                                                  work-item's
2136                                                  scratch_segment_size) memory
2137                                                  allocation. This is the
2138                                                  value from the kernel
2139                                                  dispatch packet Private
2140                                                  Segment Byte Size rounded up
2141                                                  by CP to a multiple of
2142                                                  DWORD.
2143
2144                                                  Having CP load it once avoids
2145                                                  loading it at the beginning of
2146                                                  every wavefront.
2147
2148                                                  This is not used for
2149                                                  GFX7-GFX8 since it is the same
2150                                                  value as the second SGPR of
2151                                                  Flat Scratch Init. However, it
2152                                                  may be needed for GFX9 which
2153                                                  changes the meaning of the
2154                                                  Flat Scratch Init value.
2155     then       Grid Work-Group Count X    1      32 bit count of the number of
2156                (enable_sgpr_grid                 work-groups in the X dimension
2157                _workgroup_count_X)               for the grid being
2158                                                  executed. Computed from the
2159                                                  fields in the kernel dispatch
2160                                                  packet as ((grid_size.x +
2161                                                  workgroup_size.x - 1) /
2162                                                  workgroup_size.x).
2163     then       Grid Work-Group Count Y    1      32 bit count of the number of
2164                (enable_sgpr_grid                 work-groups in the Y dimension
2165                _workgroup_count_Y &&             for the grid being
2166                less than 16 previous             executed. Computed from the
2167                SGPRs)                            fields in the kernel dispatch
2168                                                  packet as ((grid_size.y +
2169                                                  workgroup_size.y - 1) /
2170                                                  workgroupSize.y).
2171
2172                                                  Only initialized if <16
2173                                                  previous SGPRs initialized.
2174     then       Grid Work-Group Count Z    1      32 bit count of the number of
2175                (enable_sgpr_grid                 work-groups in the Z dimension
2176                _workgroup_count_Z &&             for the grid being
2177                less than 16 previous             executed. Computed from the
2178                SGPRs)                            fields in the kernel dispatch
2179                                                  packet as ((grid_size.z +
2180                                                  workgroup_size.z - 1) /
2181                                                  workgroupSize.z).
2182
2183                                                  Only initialized if <16
2184                                                  previous SGPRs initialized.
2185     then       Work-Group Id X            1      32 bit work-group id in X
2186                (enable_sgpr_workgroup_id         dimension of grid for
2187                _X)                               wavefront.
2188     then       Work-Group Id Y            1      32 bit work-group id in Y
2189                (enable_sgpr_workgroup_id         dimension of grid for
2190                _Y)                               wavefront.
2191     then       Work-Group Id Z            1      32 bit work-group id in Z
2192                (enable_sgpr_workgroup_id         dimension of grid for
2193                _Z)                               wavefront.
2194     then       Work-Group Info            1      {first_wave, 14'b0000,
2195                (enable_sgpr_workgroup            ordered_append_term[10:0],
2196                _info)                            threadgroup_size_in_waves[5:0]}
2197     then       Scratch Wave Offset        1      32 bit byte offset from base
2198                (enable_sgpr_private              of scratch base of queue
2199                _segment_wave_offset)             executing the kernel
2200                                                  dispatch. Must be used as an
2201                                                  offset with Private
2202                                                  segment address when using
2203                                                  Scratch Segment Buffer. It
2204                                                  must be used to set up FLAT
2205                                                  SCRATCH for flat addressing
2206                                                  (see
2207                                                  :ref:`amdgpu-amdhsa-flat-scratch`).
2208     ========== ========================== ====== ==============================
2209
2210The order of the VGPR registers is defined, but the compiler can specify which
2211ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2212fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2213for enabled registers are dense starting at VGPR0: the first enabled register is
2214VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2215VGPR number.
2216
2217VGPR register initial state is defined in
2218:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2219
2220  .. table:: VGPR Register Set Up Order
2221     :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2222
2223     ========== ========================== ====== ==============================
2224     VGPR Order Name                       Number Description
2225                (kernel descriptor enable  of
2226                field)                     VGPRs
2227     ========== ========================== ====== ==============================
2228     First      Work-Item Id X             1      32 bit work item id in X
2229                (Always initialized)              dimension of work-group for
2230                                                  wavefront lane.
2231     then       Work-Item Id Y             1      32 bit work item id in Y
2232                (enable_vgpr_workitem_id          dimension of work-group for
2233                > 0)                              wavefront lane.
2234     then       Work-Item Id Z             1      32 bit work item id in Z
2235                (enable_vgpr_workitem_id          dimension of work-group for
2236                > 1)                              wavefront lane.
2237     ========== ========================== ====== ==============================
2238
2239The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2240
22411. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2242   registers.
22432. Work-group Id registers X, Y, Z are set by ADC which supports any
2244   combination including none.
22453. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2246   cannot included with the flat scratch init value which is per queue.
22474. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2248   or (X, Y, Z).
2249
2250Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2251value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2252
2253The global segment can be accessed either using buffer instructions (GFX6 which
2254has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2255instructions (GFX9).
2256
2257If buffer operations are used then the compiler can generate a V# with the
2258following properties:
2259
2260* base address of 0
2261* no swizzle
2262* ATC: 1 if IOMMU present (such as APU)
2263* ptr64: 1
2264* MTYPE set to support memory coherence that matches the runtime (such as CC for
2265  APU and NC for dGPU).
2266
2267.. _amdgpu-amdhsa-kernel-prolog:
2268
2269Kernel Prolog
2270~~~~~~~~~~~~~
2271
2272.. _amdgpu-amdhsa-m0:
2273
2274M0
2275++
2276
2277GFX6-GFX8
2278  The M0 register must be initialized with a value at least the total LDS size
2279  if the kernel may access LDS via DS or flat operations. Total LDS size is
2280  available in dispatch packet. For M0, it is also possible to use maximum
2281  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2282  GFX7-GFX8).
2283GFX9
2284  The M0 register is not used for range checking LDS accesses and so does not
2285  need to be initialized in the prolog.
2286
2287.. _amdgpu-amdhsa-flat-scratch:
2288
2289Flat Scratch
2290++++++++++++
2291
2292If the kernel may use flat operations to access scratch memory, the prolog code
2293must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2294are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2295Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2296
2297GFX6
2298  Flat scratch is not supported.
2299
2300GFX7-GFX8
2301  1. The low word of Flat Scratch Init is 32 bit byte offset from
2302     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2303     being managed by SPI for the queue executing the kernel dispatch. This is
2304     the same value used in the Scratch Segment Buffer V# base address. The
2305     prolog must add the value of Scratch Wave Offset to get the wave's byte
2306     scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2307     FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2308     by 8 before moving into FLAT_SCRATCH_LO.
2309  2. The second word of Flat Scratch Init is 32 bit byte size of a single
2310     work-items scratch memory usage. This is directly loaded from the kernel
2311     dispatch packet Private Segment Byte Size and rounded up to a multiple of
2312     DWORD. Having CP load it once avoids loading it at the beginning of every
2313     wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2314     SIZE.
2315
2316GFX9
2317  The Flat Scratch Init is the 64 bit address of the base of scratch backing
2318  memory being managed by SPI for the queue executing the kernel dispatch. The
2319  prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2320  pair for use as the flat scratch base in flat memory instructions.
2321
2322.. _amdgpu-amdhsa-memory-model:
2323
2324Memory Model
2325~~~~~~~~~~~~
2326
2327This section describes the mapping of LLVM memory model onto AMDGPU machine code
2328(see :ref:`memmodel`). *The implementation is WIP.*
2329
2330.. TODO
2331   Update when implementation complete.
2332
2333The AMDGPU backend supports the memory synchronization scopes specified in
2334:ref:`amdgpu-memory-scopes`.
2335
2336The code sequences used to implement the memory model are defined in table
2337:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2338
2339The sequences specify the order of instructions that a single thread must
2340execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2341to other memory instructions executed by the same thread. This allows them to be
2342moved earlier or later which can allow them to be combined with other instances
2343of the same instruction, or hoisted/sunk out of loops to improve
2344performance. Only the instructions related to the memory model are given;
2345additional ``s_waitcnt`` instructions are required to ensure registers are
2346defined before being used. These may be able to be combined with the memory
2347model ``s_waitcnt`` instructions as described above.
2348
2349The AMDGPU backend supports the following memory models:
2350
2351  HSA Memory Model [HSA]_
2352    The HSA memory model uses a single happens-before relation for all address
2353    spaces (see :ref:`amdgpu-address-spaces`).
2354  OpenCL Memory Model [OpenCL]_
2355    The OpenCL memory model which has separate happens-before relations for the
2356    global and local address spaces. Only a fence specifying both global and
2357    local address space, and seq_cst instructions join the relationships. Since
2358    the LLVM ``memfence`` instruction does not allow an address space to be
2359    specified the OpenCL fence has to convervatively assume both local and
2360    global address space was specified. However, optimizations can often be
2361    done to eliminate the additional ``s_waitcnt`` instructions when there are
2362    no intervening memory instructions which access the corresponding address
2363    space. The code sequences in the table indicate what can be omitted for the
2364    OpenCL memory. The target triple environment is used to determine if the
2365    source language is OpenCL (see :ref:`amdgpu-opencl`).
2366
2367``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2368operations.
2369
2370``buffer/global/flat_load/store/atomic`` instructions to global memory are
2371termed vector memory operations.
2372
2373For GFX6-GFX9:
2374
2375* Each agent has multiple compute units (CU).
2376* Each CU has multiple SIMDs that execute wavefronts.
2377* The wavefronts for a single work-group are executed in the same CU but may be
2378  executed by different SIMDs.
2379* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2380  executing on it.
2381* All LDS operations of a CU are performed as wavefront wide operations in a
2382  global order and involve no caching. Completion is reported to a wavefront in
2383  execution order.
2384* The LDS memory has multiple request queues shared by the SIMDs of a
2385  CU. Therefore, the LDS operations performed by different waves of a work-group
2386  can be reordered relative to each other, which can result in reordering the
2387  visibility of vector memory operations with respect to LDS operations of other
2388  wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2389  ensure synchronization between LDS operations and vector memory operations
2390  between waves of a work-group, but not between operations performed by the
2391  same wavefront.
2392* The vector memory operations are performed as wavefront wide operations and
2393  completion is reported to a wavefront in execution order. The exception is
2394  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2395  vector memory order if they access LDS memory, and out of LDS operation order
2396  if they access global memory.
2397* The vector memory operations access a single vector L1 cache shared by all
2398  SIMDs a CU. Therefore, no special action is required for coherence between the
2399  lanes of a single wavefront, or for coherence between wavefronts in the same
2400  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2401  executing in different work-groups as they may be executing on different CUs.
2402* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2403  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2404  scalar operations are used in a restricted way so do not impact the memory
2405  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2406* The vector and scalar memory operations use an L2 cache shared by all CUs on
2407  the same agent.
2408* The L2 cache has independent channels to service disjoint ranges of virtual
2409  addresses.
2410* Each CU has a separate request queue per channel. Therefore, the vector and
2411  scalar memory operations performed by waves executing in different work-groups
2412  (which may be executing on different CUs) of an agent can be reordered
2413  relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2414  synchronization between vector memory operations of different CUs. It ensures a
2415  previous vector memory operation has completed before executing a subsequent
2416  vector memory or LDS operation and so can be used to meet the requirements of
2417  acquire and release.
2418* The L2 cache can be kept coherent with other agents on some targets, or ranges
2419  of virtual addresses can be set up to bypass it to ensure system coherence.
2420
2421Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2422or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2423memory, atomic memory orderings are not meaningful and all accesses are treated
2424as non-atomic.
2425
2426Constant address space uses ``buffer/global_load`` instructions (or equivalent
2427scalar memory instructions). Since the constant address space contents do not
2428change during the execution of a kernel dispatch it is not legal to perform
2429stores, and atomic memory orderings are not meaningful and all access are
2430treated as non-atomic.
2431
2432A memory synchronization scope wider than work-group is not meaningful for the
2433group (LDS) address space and is treated as work-group.
2434
2435The memory model does not support the region address space which is treated as
2436non-atomic.
2437
2438Acquire memory ordering is not meaningful on store atomic instructions and is
2439treated as non-atomic.
2440
2441Release memory ordering is not meaningful on load atomic instructions and is
2442treated a non-atomic.
2443
2444Acquire-release memory ordering is not meaningful on load or store atomic
2445instructions and is treated as acquire and release respectively.
2446
2447AMDGPU backend only uses scalar memory operations to access memory that is
2448proven to not change during the execution of the kernel dispatch. This includes
2449constant address space and global address space for program scope const
2450variables. Therefore the kernel machine code does not have to maintain the
2451scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2452and vector L1 caches are invalidated between kernel dispatches by CP since
2453constant address space data may change between kernel dispatch executions. See
2454:ref:`amdgpu-amdhsa-memory-spaces`.
2455
2456The one execption is if scalar writes are used to spill SGPR registers. In this
2457case the AMDGPU backend ensures the memory location used to spill is never
2458accessed by vector memory operations at the same time. If scalar writes are used
2459then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2460return since the locations may be used for vector memory instructions by a
2461future wave that uses the same scratch area, or a function call that creates a
2462frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2463as all scalar writes are write-before-read in the same thread.
2464
2465Scratch backing memory (which is used for the private address space)
2466is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2467address space is only accessed by a single thread, and is always
2468write-before-read, there is never a need to invalidate these entries from the L1
2469cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2470volatile cache lines.
2471
2472On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2473to invalidate the L2 cache. This also causes it to be treated as
2474non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2475(cache coherent) and so the L2 cache will coherent with the CPU and other
2476agents.
2477
2478  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2479     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2480
2481     ============ ============ ============== ========== ===============================
2482     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2483                  Ordering     Sync Scope     Address
2484                                              Space
2485     ============ ============ ============== ========== ===============================
2486     **Non-Atomic**
2487     -----------------------------------------------------------------------------------
2488     load         *none*       *none*         - global   - !volatile & !nontemporal
2489                                              - generic
2490                                              - private    1. buffer/global/flat_load
2491                                              - constant
2492                                                         - volatile & !nontemporal
2493
2494                                                           1. buffer/global/flat_load
2495                                                              glc=1
2496
2497                                                         - nontemporal
2498
2499                                                           1. buffer/global/flat_load
2500                                                              glc=1 slc=1
2501
2502     load         *none*       *none*         - local    1. ds_load
2503     store        *none*       *none*         - global   - !nontemporal
2504                                              - generic
2505                                              - private    1. buffer/global/flat_store
2506                                              - constant
2507                                                         - nontemporal
2508
2509                                                           1. buffer/global/flat_stote
2510                                                              glc=1 slc=1
2511
2512     store        *none*       *none*         - local    1. ds_store
2513     **Unordered Atomic**
2514     -----------------------------------------------------------------------------------
2515     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2516     store atomic unordered    *any*          *any*      *Same as non-atomic*.
2517     atomicrmw    unordered    *any*          *any*      *Same as monotonic
2518                                                         atomic*.
2519     **Monotonic Atomic**
2520     -----------------------------------------------------------------------------------
2521     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2522                               - wavefront    - generic
2523                               - workgroup
2524     load atomic  monotonic    - singlethread - local    1. ds_load
2525                               - wavefront
2526                               - workgroup
2527     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2528                               - system       - generic     glc=1
2529     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2530                               - wavefront    - generic
2531                               - workgroup
2532                               - agent
2533                               - system
2534     store atomic monotonic    - singlethread - local    1. ds_store
2535                               - wavefront
2536                               - workgroup
2537     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2538                               - wavefront    - generic
2539                               - workgroup
2540                               - agent
2541                               - system
2542     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2543                               - wavefront
2544                               - workgroup
2545     **Acquire Atomic**
2546     -----------------------------------------------------------------------------------
2547     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2548                               - wavefront    - local
2549                                              - generic
2550     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2551     load atomic  acquire      - workgroup    - local    1. ds_load
2552                                                         2. s_waitcnt lgkmcnt(0)
2553
2554                                                           - If OpenCL, omit.
2555                                                           - Must happen before
2556                                                             any following
2557                                                             global/generic
2558                                                             load/load
2559                                                             atomic/store/store
2560                                                             atomic/atomicrmw.
2561                                                           - Ensures any
2562                                                             following global
2563                                                             data read is no
2564                                                             older than the load
2565                                                             atomic value being
2566                                                             acquired.
2567     load atomic  acquire      - workgroup    - generic  1. flat_load
2568                                                         2. s_waitcnt lgkmcnt(0)
2569
2570                                                           - If OpenCL, omit.
2571                                                           - Must happen before
2572                                                             any following
2573                                                             global/generic
2574                                                             load/load
2575                                                             atomic/store/store
2576                                                             atomic/atomicrmw.
2577                                                           - Ensures any
2578                                                             following global
2579                                                             data read is no
2580                                                             older than the load
2581                                                             atomic value being
2582                                                             acquired.
2583     load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2584                               - system                     glc=1
2585                                                         2. s_waitcnt vmcnt(0)
2586
2587                                                           - Must happen before
2588                                                             following
2589                                                             buffer_wbinvl1_vol.
2590                                                           - Ensures the load
2591                                                             has completed
2592                                                             before invalidating
2593                                                             the cache.
2594
2595                                                         3. buffer_wbinvl1_vol
2596
2597                                                           - Must happen before
2598                                                             any following
2599                                                             global/generic
2600                                                             load/load
2601                                                             atomic/atomicrmw.
2602                                                           - Ensures that
2603                                                             following
2604                                                             loads will not see
2605                                                             stale global data.
2606
2607     load atomic  acquire      - agent        - generic  1. flat_load glc=1
2608                               - system                  2. s_waitcnt vmcnt(0) &
2609                                                            lgkmcnt(0)
2610
2611                                                           - If OpenCL omit
2612                                                             lgkmcnt(0).
2613                                                           - Must happen before
2614                                                             following
2615                                                             buffer_wbinvl1_vol.
2616                                                           - Ensures the flat_load
2617                                                             has completed
2618                                                             before invalidating
2619                                                             the cache.
2620
2621                                                         3. buffer_wbinvl1_vol
2622
2623                                                           - Must happen before
2624                                                             any following
2625                                                             global/generic
2626                                                             load/load
2627                                                             atomic/atomicrmw.
2628                                                           - Ensures that
2629                                                             following loads
2630                                                             will not see stale
2631                                                             global data.
2632
2633     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2634                               - wavefront    - local
2635                                              - generic
2636     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2637     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2638                                                         2. waitcnt lgkmcnt(0)
2639
2640                                                           - If OpenCL, omit.
2641                                                           - Must happen before
2642                                                             any following
2643                                                             global/generic
2644                                                             load/load
2645                                                             atomic/store/store
2646                                                             atomic/atomicrmw.
2647                                                           - Ensures any
2648                                                             following global
2649                                                             data read is no
2650                                                             older than the
2651                                                             atomicrmw value
2652                                                             being acquired.
2653
2654     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2655                                                         2. waitcnt lgkmcnt(0)
2656
2657                                                           - If OpenCL, omit.
2658                                                           - Must happen before
2659                                                             any following
2660                                                             global/generic
2661                                                             load/load
2662                                                             atomic/store/store
2663                                                             atomic/atomicrmw.
2664                                                           - Ensures any
2665                                                             following global
2666                                                             data read is no
2667                                                             older than the
2668                                                             atomicrmw value
2669                                                             being acquired.
2670
2671     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2672                               - system                  2. s_waitcnt vmcnt(0)
2673
2674                                                           - Must happen before
2675                                                             following
2676                                                             buffer_wbinvl1_vol.
2677                                                           - Ensures the
2678                                                             atomicrmw has
2679                                                             completed before
2680                                                             invalidating the
2681                                                             cache.
2682
2683                                                         3. buffer_wbinvl1_vol
2684
2685                                                           - Must happen before
2686                                                             any following
2687                                                             global/generic
2688                                                             load/load
2689                                                             atomic/atomicrmw.
2690                                                           - Ensures that
2691                                                             following loads
2692                                                             will not see stale
2693                                                             global data.
2694
2695     atomicrmw    acquire      - agent        - generic  1. flat_atomic
2696                               - system                  2. s_waitcnt vmcnt(0) &
2697                                                            lgkmcnt(0)
2698
2699                                                           - If OpenCL, omit
2700                                                             lgkmcnt(0).
2701                                                           - Must happen before
2702                                                             following
2703                                                             buffer_wbinvl1_vol.
2704                                                           - Ensures the
2705                                                             atomicrmw has
2706                                                             completed before
2707                                                             invalidating the
2708                                                             cache.
2709
2710                                                         3. buffer_wbinvl1_vol
2711
2712                                                           - Must happen before
2713                                                             any following
2714                                                             global/generic
2715                                                             load/load
2716                                                             atomic/atomicrmw.
2717                                                           - Ensures that
2718                                                             following loads
2719                                                             will not see stale
2720                                                             global data.
2721
2722     fence        acquire      - singlethread *none*     *none*
2723                               - wavefront
2724     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2725
2726                                                           - If OpenCL and
2727                                                             address space is
2728                                                             not generic, omit.
2729                                                           - However, since LLVM
2730                                                             currently has no
2731                                                             address space on
2732                                                             the fence need to
2733                                                             conservatively
2734                                                             always generate. If
2735                                                             fence had an
2736                                                             address space then
2737                                                             set to address
2738                                                             space of OpenCL
2739                                                             fence flag, or to
2740                                                             generic if both
2741                                                             local and global
2742                                                             flags are
2743                                                             specified.
2744                                                           - Must happen after
2745                                                             any preceding
2746                                                             local/generic load
2747                                                             atomic/atomicrmw
2748                                                             with an equal or
2749                                                             wider sync scope
2750                                                             and memory ordering
2751                                                             stronger than
2752                                                             unordered (this is
2753                                                             termed the
2754                                                             fence-paired-atomic).
2755                                                           - Must happen before
2756                                                             any following
2757                                                             global/generic
2758                                                             load/load
2759                                                             atomic/store/store
2760                                                             atomic/atomicrmw.
2761                                                           - Ensures any
2762                                                             following global
2763                                                             data read is no
2764                                                             older than the
2765                                                             value read by the
2766                                                             fence-paired-atomic.
2767
2768     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2769                               - system                     vmcnt(0)
2770
2771                                                           - If OpenCL and
2772                                                             address space is
2773                                                             not generic, omit
2774                                                             lgkmcnt(0).
2775                                                           - However, since LLVM
2776                                                             currently has no
2777                                                             address space on
2778                                                             the fence need to
2779                                                             conservatively
2780                                                             always generate
2781                                                             (see comment for
2782                                                             previous fence).
2783                                                           - Could be split into
2784                                                             separate s_waitcnt
2785                                                             vmcnt(0) and
2786                                                             s_waitcnt
2787                                                             lgkmcnt(0) to allow
2788                                                             them to be
2789                                                             independently moved
2790                                                             according to the
2791                                                             following rules.
2792                                                           - s_waitcnt vmcnt(0)
2793                                                             must happen after
2794                                                             any preceding
2795                                                             global/generic load
2796                                                             atomic/atomicrmw
2797                                                             with an equal or
2798                                                             wider sync scope
2799                                                             and memory ordering
2800                                                             stronger than
2801                                                             unordered (this is
2802                                                             termed the
2803                                                             fence-paired-atomic).
2804                                                           - s_waitcnt lgkmcnt(0)
2805                                                             must happen after
2806                                                             any preceding
2807                                                             local/generic load
2808                                                             atomic/atomicrmw
2809                                                             with an equal or
2810                                                             wider sync scope
2811                                                             and memory ordering
2812                                                             stronger than
2813                                                             unordered (this is
2814                                                             termed the
2815                                                             fence-paired-atomic).
2816                                                           - Must happen before
2817                                                             the following
2818                                                             buffer_wbinvl1_vol.
2819                                                           - Ensures that the
2820                                                             fence-paired atomic
2821                                                             has completed
2822                                                             before invalidating
2823                                                             the
2824                                                             cache. Therefore
2825                                                             any following
2826                                                             locations read must
2827                                                             be no older than
2828                                                             the value read by
2829                                                             the
2830                                                             fence-paired-atomic.
2831
2832                                                         2. buffer_wbinvl1_vol
2833
2834                                                           - Must happen before any
2835                                                             following global/generic
2836                                                             load/load
2837                                                             atomic/store/store
2838                                                             atomic/atomicrmw.
2839                                                           - Ensures that
2840                                                             following loads
2841                                                             will not see stale
2842                                                             global data.
2843
2844     **Release Atomic**
2845     -----------------------------------------------------------------------------------
2846     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2847                               - wavefront    - local
2848                                              - generic
2849     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2850
2851                                                           - If OpenCL, omit.
2852                                                           - Must happen after
2853                                                             any preceding
2854                                                             local/generic
2855                                                             load/store/load
2856                                                             atomic/store
2857                                                             atomic/atomicrmw.
2858                                                           - Must happen before
2859                                                             the following
2860                                                             store.
2861                                                           - Ensures that all
2862                                                             memory operations
2863                                                             to local have
2864                                                             completed before
2865                                                             performing the
2866                                                             store that is being
2867                                                             released.
2868
2869                                                         2. buffer/global/flat_store
2870     store atomic release      - workgroup    - local    1. ds_store
2871     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2872
2873                                                           - If OpenCL, omit.
2874                                                           - Must happen after
2875                                                             any preceding
2876                                                             local/generic
2877                                                             load/store/load
2878                                                             atomic/store
2879                                                             atomic/atomicrmw.
2880                                                           - Must happen before
2881                                                             the following
2882                                                             store.
2883                                                           - Ensures that all
2884                                                             memory operations
2885                                                             to local have
2886                                                             completed before
2887                                                             performing the
2888                                                             store that is being
2889                                                             released.
2890
2891                                                         2. flat_store
2892     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2893                               - system       - generic     vmcnt(0)
2894
2895                                                           - If OpenCL, omit
2896                                                             lgkmcnt(0).
2897                                                           - Could be split into
2898                                                             separate s_waitcnt
2899                                                             vmcnt(0) and
2900                                                             s_waitcnt
2901                                                             lgkmcnt(0) to allow
2902                                                             them to be
2903                                                             independently moved
2904                                                             according to the
2905                                                             following rules.
2906                                                           - s_waitcnt vmcnt(0)
2907                                                             must happen after
2908                                                             any preceding
2909                                                             global/generic
2910                                                             load/store/load
2911                                                             atomic/store
2912                                                             atomic/atomicrmw.
2913                                                           - s_waitcnt lgkmcnt(0)
2914                                                             must happen after
2915                                                             any preceding
2916                                                             local/generic
2917                                                             load/store/load
2918                                                             atomic/store
2919                                                             atomic/atomicrmw.
2920                                                           - Must happen before
2921                                                             the following
2922                                                             store.
2923                                                           - Ensures that all
2924                                                             memory operations
2925                                                             to memory have
2926                                                             completed before
2927                                                             performing the
2928                                                             store that is being
2929                                                             released.
2930
2931                                                         2. buffer/global/ds/flat_store
2932     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2933                               - wavefront    - local
2934                                              - generic
2935     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2936
2937                                                           - If OpenCL, omit.
2938                                                           - Must happen after
2939                                                             any preceding
2940                                                             local/generic
2941                                                             load/store/load
2942                                                             atomic/store
2943                                                             atomic/atomicrmw.
2944                                                           - Must happen before
2945                                                             the following
2946                                                             atomicrmw.
2947                                                           - Ensures that all
2948                                                             memory operations
2949                                                             to local have
2950                                                             completed before
2951                                                             performing the
2952                                                             atomicrmw that is
2953                                                             being released.
2954
2955                                                         2. buffer/global/flat_atomic
2956     atomicrmw    release      - workgroup    - local    1. ds_atomic
2957     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2958
2959                                                           - If OpenCL, omit.
2960                                                           - Must happen after
2961                                                             any preceding
2962                                                             local/generic
2963                                                             load/store/load
2964                                                             atomic/store
2965                                                             atomic/atomicrmw.
2966                                                           - Must happen before
2967                                                             the following
2968                                                             atomicrmw.
2969                                                           - Ensures that all
2970                                                             memory operations
2971                                                             to local have
2972                                                             completed before
2973                                                             performing the
2974                                                             atomicrmw that is
2975                                                             being released.
2976
2977                                                         2. flat_atomic
2978     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2979                               - system       - generic     vmcnt(0)
2980
2981                                                           - If OpenCL, omit
2982                                                             lgkmcnt(0).
2983                                                           - Could be split into
2984                                                             separate s_waitcnt
2985                                                             vmcnt(0) and
2986                                                             s_waitcnt
2987                                                             lgkmcnt(0) to allow
2988                                                             them to be
2989                                                             independently moved
2990                                                             according to the
2991                                                             following rules.
2992                                                           - s_waitcnt vmcnt(0)
2993                                                             must happen after
2994                                                             any preceding
2995                                                             global/generic
2996                                                             load/store/load
2997                                                             atomic/store
2998                                                             atomic/atomicrmw.
2999                                                           - s_waitcnt lgkmcnt(0)
3000                                                             must happen after
3001                                                             any preceding
3002                                                             local/generic
3003                                                             load/store/load
3004                                                             atomic/store
3005                                                             atomic/atomicrmw.
3006                                                           - Must happen before
3007                                                             the following
3008                                                             atomicrmw.
3009                                                           - Ensures that all
3010                                                             memory operations
3011                                                             to global and local
3012                                                             have completed
3013                                                             before performing
3014                                                             the atomicrmw that
3015                                                             is being released.
3016
3017                                                         2. buffer/global/ds/flat_atomic
3018     fence        release      - singlethread *none*     *none*
3019                               - wavefront
3020     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3021
3022                                                           - If OpenCL and
3023                                                             address space is
3024                                                             not generic, omit.
3025                                                           - However, since LLVM
3026                                                             currently has no
3027                                                             address space on
3028                                                             the fence need to
3029                                                             conservatively
3030                                                             always generate. If
3031                                                             fence had an
3032                                                             address space then
3033                                                             set to address
3034                                                             space of OpenCL
3035                                                             fence flag, or to
3036                                                             generic if both
3037                                                             local and global
3038                                                             flags are
3039                                                             specified.
3040                                                           - Must happen after
3041                                                             any preceding
3042                                                             local/generic
3043                                                             load/load
3044                                                             atomic/store/store
3045                                                             atomic/atomicrmw.
3046                                                           - Must happen before
3047                                                             any following store
3048                                                             atomic/atomicrmw
3049                                                             with an equal or
3050                                                             wider sync scope
3051                                                             and memory ordering
3052                                                             stronger than
3053                                                             unordered (this is
3054                                                             termed the
3055                                                             fence-paired-atomic).
3056                                                           - Ensures that all
3057                                                             memory operations
3058                                                             to local have
3059                                                             completed before
3060                                                             performing the
3061                                                             following
3062                                                             fence-paired-atomic.
3063
3064     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3065                               - system                     vmcnt(0)
3066
3067                                                           - If OpenCL and
3068                                                             address space is
3069                                                             not generic, omit
3070                                                             lgkmcnt(0).
3071                                                           - If OpenCL and
3072                                                             address space is
3073                                                             local, omit
3074                                                             vmcnt(0).
3075                                                           - However, since LLVM
3076                                                             currently has no
3077                                                             address space on
3078                                                             the fence need to
3079                                                             conservatively
3080                                                             always generate. If
3081                                                             fence had an
3082                                                             address space then
3083                                                             set to address
3084                                                             space of OpenCL
3085                                                             fence flag, or to
3086                                                             generic if both
3087                                                             local and global
3088                                                             flags are
3089                                                             specified.
3090                                                           - Could be split into
3091                                                             separate s_waitcnt
3092                                                             vmcnt(0) and
3093                                                             s_waitcnt
3094                                                             lgkmcnt(0) to allow
3095                                                             them to be
3096                                                             independently moved
3097                                                             according to the
3098                                                             following rules.
3099                                                           - s_waitcnt vmcnt(0)
3100                                                             must happen after
3101                                                             any preceding
3102                                                             global/generic
3103                                                             load/store/load
3104                                                             atomic/store
3105                                                             atomic/atomicrmw.
3106                                                           - s_waitcnt lgkmcnt(0)
3107                                                             must happen after
3108                                                             any preceding
3109                                                             local/generic
3110                                                             load/store/load
3111                                                             atomic/store
3112                                                             atomic/atomicrmw.
3113                                                           - Must happen before
3114                                                             any following store
3115                                                             atomic/atomicrmw
3116                                                             with an equal or
3117                                                             wider sync scope
3118                                                             and memory ordering
3119                                                             stronger than
3120                                                             unordered (this is
3121                                                             termed the
3122                                                             fence-paired-atomic).
3123                                                           - Ensures that all
3124                                                             memory operations
3125                                                             have
3126                                                             completed before
3127                                                             performing the
3128                                                             following
3129                                                             fence-paired-atomic.
3130
3131     **Acquire-Release Atomic**
3132     -----------------------------------------------------------------------------------
3133     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3134                               - wavefront    - local
3135                                              - generic
3136     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3137
3138                                                           - If OpenCL, omit.
3139                                                           - Must happen after
3140                                                             any preceding
3141                                                             local/generic
3142                                                             load/store/load
3143                                                             atomic/store
3144                                                             atomic/atomicrmw.
3145                                                           - Must happen before
3146                                                             the following
3147                                                             atomicrmw.
3148                                                           - Ensures that all
3149                                                             memory operations
3150                                                             to local have
3151                                                             completed before
3152                                                             performing the
3153                                                             atomicrmw that is
3154                                                             being released.
3155
3156                                                         2. buffer/global/flat_atomic
3157     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3158                                                         2. s_waitcnt lgkmcnt(0)
3159
3160                                                           - If OpenCL, omit.
3161                                                           - Must happen before
3162                                                             any following
3163                                                             global/generic
3164                                                             load/load
3165                                                             atomic/store/store
3166                                                             atomic/atomicrmw.
3167                                                           - Ensures any
3168                                                             following global
3169                                                             data read is no
3170                                                             older than the load
3171                                                             atomic value being
3172                                                             acquired.
3173
3174     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3175
3176                                                           - If OpenCL, omit.
3177                                                           - Must happen after
3178                                                             any preceding
3179                                                             local/generic
3180                                                             load/store/load
3181                                                             atomic/store
3182                                                             atomic/atomicrmw.
3183                                                           - Must happen before
3184                                                             the following
3185                                                             atomicrmw.
3186                                                           - Ensures that all
3187                                                             memory operations
3188                                                             to local have
3189                                                             completed before
3190                                                             performing the
3191                                                             atomicrmw that is
3192                                                             being released.
3193
3194                                                         2. flat_atomic
3195                                                         3. s_waitcnt lgkmcnt(0)
3196
3197                                                           - If OpenCL, omit.
3198                                                           - Must happen before
3199                                                             any following
3200                                                             global/generic
3201                                                             load/load
3202                                                             atomic/store/store
3203                                                             atomic/atomicrmw.
3204                                                           - Ensures any
3205                                                             following global
3206                                                             data read is no
3207                                                             older than the load
3208                                                             atomic value being
3209                                                             acquired.
3210
3211     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3212                               - system                     vmcnt(0)
3213
3214                                                           - If OpenCL, omit
3215                                                             lgkmcnt(0).
3216                                                           - Could be split into
3217                                                             separate s_waitcnt
3218                                                             vmcnt(0) and
3219                                                             s_waitcnt
3220                                                             lgkmcnt(0) to allow
3221                                                             them to be
3222                                                             independently moved
3223                                                             according to the
3224                                                             following rules.
3225                                                           - s_waitcnt vmcnt(0)
3226                                                             must happen after
3227                                                             any preceding
3228                                                             global/generic
3229                                                             load/store/load
3230                                                             atomic/store
3231                                                             atomic/atomicrmw.
3232                                                           - s_waitcnt lgkmcnt(0)
3233                                                             must happen after
3234                                                             any preceding
3235                                                             local/generic
3236                                                             load/store/load
3237                                                             atomic/store
3238                                                             atomic/atomicrmw.
3239                                                           - Must happen before
3240                                                             the following
3241                                                             atomicrmw.
3242                                                           - Ensures that all
3243                                                             memory operations
3244                                                             to global have
3245                                                             completed before
3246                                                             performing the
3247                                                             atomicrmw that is
3248                                                             being released.
3249
3250                                                         2. buffer/global/flat_atomic
3251                                                         3. s_waitcnt vmcnt(0)
3252
3253                                                           - Must happen before
3254                                                             following
3255                                                             buffer_wbinvl1_vol.
3256                                                           - Ensures the
3257                                                             atomicrmw has
3258                                                             completed before
3259                                                             invalidating the
3260                                                             cache.
3261
3262                                                         4. buffer_wbinvl1_vol
3263
3264                                                           - Must happen before
3265                                                             any following
3266                                                             global/generic
3267                                                             load/load
3268                                                             atomic/atomicrmw.
3269                                                           - Ensures that
3270                                                             following loads
3271                                                             will not see stale
3272                                                             global data.
3273
3274     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3275                               - system                     vmcnt(0)
3276
3277                                                           - If OpenCL, omit
3278                                                             lgkmcnt(0).
3279                                                           - Could be split into
3280                                                             separate s_waitcnt
3281                                                             vmcnt(0) and
3282                                                             s_waitcnt
3283                                                             lgkmcnt(0) to allow
3284                                                             them to be
3285                                                             independently moved
3286                                                             according to the
3287                                                             following rules.
3288                                                           - s_waitcnt vmcnt(0)
3289                                                             must happen after
3290                                                             any preceding
3291                                                             global/generic
3292                                                             load/store/load
3293                                                             atomic/store
3294                                                             atomic/atomicrmw.
3295                                                           - s_waitcnt lgkmcnt(0)
3296                                                             must happen after
3297                                                             any preceding
3298                                                             local/generic
3299                                                             load/store/load
3300                                                             atomic/store
3301                                                             atomic/atomicrmw.
3302                                                           - Must happen before
3303                                                             the following
3304                                                             atomicrmw.
3305                                                           - Ensures that all
3306                                                             memory operations
3307                                                             to global have
3308                                                             completed before
3309                                                             performing the
3310                                                             atomicrmw that is
3311                                                             being released.
3312
3313                                                         2. flat_atomic
3314                                                         3. s_waitcnt vmcnt(0) &
3315                                                            lgkmcnt(0)
3316
3317                                                           - If OpenCL, omit
3318                                                             lgkmcnt(0).
3319                                                           - Must happen before
3320                                                             following
3321                                                             buffer_wbinvl1_vol.
3322                                                           - Ensures the
3323                                                             atomicrmw has
3324                                                             completed before
3325                                                             invalidating the
3326                                                             cache.
3327
3328                                                         4. buffer_wbinvl1_vol
3329
3330                                                           - Must happen before
3331                                                             any following
3332                                                             global/generic
3333                                                             load/load
3334                                                             atomic/atomicrmw.
3335                                                           - Ensures that
3336                                                             following loads
3337                                                             will not see stale
3338                                                             global data.
3339
3340     fence        acq_rel      - singlethread *none*     *none*
3341                               - wavefront
3342     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3343
3344                                                           - If OpenCL and
3345                                                             address space is
3346                                                             not generic, omit.
3347                                                           - However,
3348                                                             since LLVM
3349                                                             currently has no
3350                                                             address space on
3351                                                             the fence need to
3352                                                             conservatively
3353                                                             always generate
3354                                                             (see comment for
3355                                                             previous fence).
3356                                                           - Must happen after
3357                                                             any preceding
3358                                                             local/generic
3359                                                             load/load
3360                                                             atomic/store/store
3361                                                             atomic/atomicrmw.
3362                                                           - Must happen before
3363                                                             any following
3364                                                             global/generic
3365                                                             load/load
3366                                                             atomic/store/store
3367                                                             atomic/atomicrmw.
3368                                                           - Ensures that all
3369                                                             memory operations
3370                                                             to local have
3371                                                             completed before
3372                                                             performing any
3373                                                             following global
3374                                                             memory operations.
3375                                                           - Ensures that the
3376                                                             preceding
3377                                                             local/generic load
3378                                                             atomic/atomicrmw
3379                                                             with an equal or
3380                                                             wider sync scope
3381                                                             and memory ordering
3382                                                             stronger than
3383                                                             unordered (this is
3384                                                             termed the
3385                                                             acquire-fence-paired-atomic
3386                                                             ) has completed
3387                                                             before following
3388                                                             global memory
3389                                                             operations. This
3390                                                             satisfies the
3391                                                             requirements of
3392                                                             acquire.
3393                                                           - Ensures that all
3394                                                             previous memory
3395                                                             operations have
3396                                                             completed before a
3397                                                             following
3398                                                             local/generic store
3399                                                             atomic/atomicrmw
3400                                                             with an equal or
3401                                                             wider sync scope
3402                                                             and memory ordering
3403                                                             stronger than
3404                                                             unordered (this is
3405                                                             termed the
3406                                                             release-fence-paired-atomic
3407                                                             ). This satisfies the
3408                                                             requirements of
3409                                                             release.
3410
3411     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3412                               - system                     vmcnt(0)
3413
3414                                                           - If OpenCL and
3415                                                             address space is
3416                                                             not generic, omit
3417                                                             lgkmcnt(0).
3418                                                           - However, since LLVM
3419                                                             currently has no
3420                                                             address space on
3421                                                             the fence need to
3422                                                             conservatively
3423                                                             always generate
3424                                                             (see comment for
3425                                                             previous fence).
3426                                                           - Could be split into
3427                                                             separate s_waitcnt
3428                                                             vmcnt(0) and
3429                                                             s_waitcnt
3430                                                             lgkmcnt(0) to allow
3431                                                             them to be
3432                                                             independently moved
3433                                                             according to the
3434                                                             following rules.
3435                                                           - s_waitcnt vmcnt(0)
3436                                                             must happen after
3437                                                             any preceding
3438                                                             global/generic
3439                                                             load/store/load
3440                                                             atomic/store
3441                                                             atomic/atomicrmw.
3442                                                           - s_waitcnt lgkmcnt(0)
3443                                                             must happen after
3444                                                             any preceding
3445                                                             local/generic
3446                                                             load/store/load
3447                                                             atomic/store
3448                                                             atomic/atomicrmw.
3449                                                           - Must happen before
3450                                                             the following
3451                                                             buffer_wbinvl1_vol.
3452                                                           - Ensures that the
3453                                                             preceding
3454                                                             global/local/generic
3455                                                             load
3456                                                             atomic/atomicrmw
3457                                                             with an equal or
3458                                                             wider sync scope
3459                                                             and memory ordering
3460                                                             stronger than
3461                                                             unordered (this is
3462                                                             termed the
3463                                                             acquire-fence-paired-atomic
3464                                                             ) has completed
3465                                                             before invalidating
3466                                                             the cache. This
3467                                                             satisfies the
3468                                                             requirements of
3469                                                             acquire.
3470                                                           - Ensures that all
3471                                                             previous memory
3472                                                             operations have
3473                                                             completed before a
3474                                                             following
3475                                                             global/local/generic
3476                                                             store
3477                                                             atomic/atomicrmw
3478                                                             with an equal or
3479                                                             wider sync scope
3480                                                             and memory ordering
3481                                                             stronger than
3482                                                             unordered (this is
3483                                                             termed the
3484                                                             release-fence-paired-atomic
3485                                                             ). This satisfies the
3486                                                             requirements of
3487                                                             release.
3488
3489                                                         2. buffer_wbinvl1_vol
3490
3491                                                           - Must happen before
3492                                                             any following
3493                                                             global/generic
3494                                                             load/load
3495                                                             atomic/store/store
3496                                                             atomic/atomicrmw.
3497                                                           - Ensures that
3498                                                             following loads
3499                                                             will not see stale
3500                                                             global data. This
3501                                                             satisfies the
3502                                                             requirements of
3503                                                             acquire.
3504
3505     **Sequential Consistent Atomic**
3506     -----------------------------------------------------------------------------------
3507     load atomic  seq_cst      - singlethread - global   *Same as corresponding
3508                               - wavefront    - local    load atomic acquire,
3509                                              - generic  except must generated
3510                                                         all instructions even
3511                                                         for OpenCL.*
3512     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3513                                              - generic
3514                                                           - Must
3515                                                             happen after
3516                                                             preceding
3517                                                             global/generic load
3518                                                             atomic/store
3519                                                             atomic/atomicrmw
3520                                                             with memory
3521                                                             ordering of seq_cst
3522                                                             and with equal or
3523                                                             wider sync scope.
3524                                                             (Note that seq_cst
3525                                                             fences have their
3526                                                             own s_waitcnt
3527                                                             lgkmcnt(0) and so do
3528                                                             not need to be
3529                                                             considered.)
3530                                                           - Ensures any
3531                                                             preceding
3532                                                             sequential
3533                                                             consistent local
3534                                                             memory instructions
3535                                                             have completed
3536                                                             before executing
3537                                                             this sequentially
3538                                                             consistent
3539                                                             instruction. This
3540                                                             prevents reordering
3541                                                             a seq_cst store
3542                                                             followed by a
3543                                                             seq_cst load. (Note
3544                                                             that seq_cst is
3545                                                             stronger than
3546                                                             acquire/release as
3547                                                             the reordering of
3548                                                             load acquire
3549                                                             followed by a store
3550                                                             release is
3551                                                             prevented by the
3552                                                             waitcnt of
3553                                                             the release, but
3554                                                             there is nothing
3555                                                             preventing a store
3556                                                             release followed by
3557                                                             load acquire from
3558                                                             competing out of
3559                                                             order.)
3560
3561                                                         2. *Following
3562                                                            instructions same as
3563                                                            corresponding load
3564                                                            atomic acquire,
3565                                                            except must generated
3566                                                            all instructions even
3567                                                            for OpenCL.*
3568     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3569                                                         load atomic acquire,
3570                                                         except must generated
3571                                                         all instructions even
3572                                                         for OpenCL.*
3573     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3574                               - system       - generic     vmcnt(0)
3575
3576                                                           - Could be split into
3577                                                             separate s_waitcnt
3578                                                             vmcnt(0)
3579                                                             and s_waitcnt
3580                                                             lgkmcnt(0) to allow
3581                                                             them to be
3582                                                             independently moved
3583                                                             according to the
3584                                                             following rules.
3585                                                           - waitcnt lgkmcnt(0)
3586                                                             must happen after
3587                                                             preceding
3588                                                             global/generic load
3589                                                             atomic/store
3590                                                             atomic/atomicrmw
3591                                                             with memory
3592                                                             ordering of seq_cst
3593                                                             and with equal or
3594                                                             wider sync scope.
3595                                                             (Note that seq_cst
3596                                                             fences have their
3597                                                             own s_waitcnt
3598                                                             lgkmcnt(0) and so do
3599                                                             not need to be
3600                                                             considered.)
3601                                                           - waitcnt vmcnt(0)
3602                                                             must happen after
3603                                                             preceding
3604                                                             global/generic load
3605                                                             atomic/store
3606                                                             atomic/atomicrmw
3607                                                             with memory
3608                                                             ordering of seq_cst
3609                                                             and with equal or
3610                                                             wider sync scope.
3611                                                             (Note that seq_cst
3612                                                             fences have their
3613                                                             own s_waitcnt
3614                                                             vmcnt(0) and so do
3615                                                             not need to be
3616                                                             considered.)
3617                                                           - Ensures any
3618                                                             preceding
3619                                                             sequential
3620                                                             consistent global
3621                                                             memory instructions
3622                                                             have completed
3623                                                             before executing
3624                                                             this sequentially
3625                                                             consistent
3626                                                             instruction. This
3627                                                             prevents reordering
3628                                                             a seq_cst store
3629                                                             followed by a
3630                                                             seq_cst load. (Note
3631                                                             that seq_cst is
3632                                                             stronger than
3633                                                             acquire/release as
3634                                                             the reordering of
3635                                                             load acquire
3636                                                             followed by a store
3637                                                             release is
3638                                                             prevented by the
3639                                                             waitcnt of
3640                                                             the release, but
3641                                                             there is nothing
3642                                                             preventing a store
3643                                                             release followed by
3644                                                             load acquire from
3645                                                             competing out of
3646                                                             order.)
3647
3648                                                         2. *Following
3649                                                            instructions same as
3650                                                            corresponding load
3651                                                            atomic acquire,
3652                                                            except must generated
3653                                                            all instructions even
3654                                                            for OpenCL.*
3655     store atomic seq_cst      - singlethread - global   *Same as corresponding
3656                               - wavefront    - local    store atomic release,
3657                               - workgroup    - generic  except must generated
3658                                                         all instructions even
3659                                                         for OpenCL.*
3660     store atomic seq_cst      - agent        - global   *Same as corresponding
3661                               - system       - generic  store atomic release,
3662                                                         except must generated
3663                                                         all instructions even
3664                                                         for OpenCL.*
3665     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3666                               - wavefront    - local    atomicrmw acq_rel,
3667                               - workgroup    - generic  except must generated
3668                                                         all instructions even
3669                                                         for OpenCL.*
3670     atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3671                               - system       - generic  atomicrmw acq_rel,
3672                                                         except must generated
3673                                                         all instructions even
3674                                                         for OpenCL.*
3675     fence        seq_cst      - singlethread *none*     *Same as corresponding
3676                               - wavefront               fence acq_rel,
3677                               - workgroup               except must generated
3678                               - agent                   all instructions even
3679                               - system                  for OpenCL.*
3680     ============ ============ ============== ========== ===============================
3681
3682The memory order also adds the single thread optimization constrains defined in
3683table
3684:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3685
3686  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3687     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3688
3689     ============ ==============================================================
3690     LLVM Memory  Optimization Constraints
3691     Ordering
3692     ============ ==============================================================
3693     unordered    *none*
3694     monotonic    *none*
3695     acquire      - If a load atomic/atomicrmw then no following load/load
3696                    atomic/store/ store atomic/atomicrmw/fence instruction can
3697                    be moved before the acquire.
3698                  - If a fence then same as load atomic, plus no preceding
3699                    associated fence-paired-atomic can be moved after the fence.
3700     release      - If a store atomic/atomicrmw then no preceding load/load
3701                    atomic/store/ store atomic/atomicrmw/fence instruction can
3702                    be moved after the release.
3703                  - If a fence then same as store atomic, plus no following
3704                    associated fence-paired-atomic can be moved before the
3705                    fence.
3706     acq_rel      Same constraints as both acquire and release.
3707     seq_cst      - If a load atomic then same constraints as acquire, plus no
3708                    preceding sequentially consistent load atomic/store
3709                    atomic/atomicrmw/fence instruction can be moved after the
3710                    seq_cst.
3711                  - If a store atomic then the same constraints as release, plus
3712                    no following sequentially consistent load atomic/store
3713                    atomic/atomicrmw/fence instruction can be moved before the
3714                    seq_cst.
3715                  - If an atomicrmw/fence then same constraints as acq_rel.
3716     ============ ==============================================================
3717
3718Trap Handler ABI
3719~~~~~~~~~~~~~~~~
3720
3721For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3722(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3723the ``s_trap`` instruction with the following usage:
3724
3725  .. table:: AMDGPU Trap Handler for AMDHSA OS
3726     :name: amdgpu-trap-handler-for-amdhsa-os-table
3727
3728     =================== =============== =============== =======================
3729     Usage               Code Sequence   Trap Handler    Description
3730                                         Inputs
3731     =================== =============== =============== =======================
3732     reserved            ``s_trap 0x00``                 Reserved by hardware.
3733     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3734                                           ``queue_ptr`` ``debugtrap``
3735                                         ``VGPR0``:      intrinsic (not
3736                                           ``arg``       implemented).
3737     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3738                                           ``queue_ptr`` terminated and its
3739                                                         associated queue put
3740                                                         into the error state.
3741     ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3742                                           ``queue_ptr`` installed handled
3743                                                         same as ``llvm.trap``.
3744     debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3745                                                         breakpoints.
3746     debugger            ``s_trap 0x08``                 Reserved for debugger.
3747     debugger            ``s_trap 0xfe``                 Reserved for debugger.
3748     debugger            ``s_trap 0xff``                 Reserved for debugger.
3749     =================== =============== =============== =======================
3750
3751Unspecified OS
3752--------------
3753
3754This section provides code conventions used when the target triple OS is
3755empty (see :ref:`amdgpu-target-triples`).
3756
3757Trap Handler ABI
3758~~~~~~~~~~~~~~~~
3759
3760For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3761not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3762instructions are handled as follows:
3763
3764  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3765     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3766
3767     =============== =============== ===========================================
3768     Usage           Code Sequence   Description
3769     =============== =============== ===========================================
3770     llvm.trap       s_endpgm        Causes wavefront to be terminated.
3771     llvm.debugtrap  *none*          Compiler warning given that there is no
3772                                     trap handler installed.
3773     =============== =============== ===========================================
3774
3775Source Languages
3776================
3777
3778.. _amdgpu-opencl:
3779
3780OpenCL
3781------
3782
3783When generating code for the OpenCL language the target triple environment
3784should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3785
3786When the language is OpenCL the following differences occur:
3787
37881. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37892. The AMDGPU backend adds additional arguments to the kernel.
37903. Additional metadata is generated
3791   (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3792
3793.. TODO
3794   Specify what affect this has. Hidden arguments added. Additional metadata
3795   generated.
3796
3797.. _amdgpu-hcc:
3798
3799HCC
3800---
3801
3802When generating code for the OpenCL language the target triple environment
3803should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3804
3805When the language is OpenCL the following differences occur:
3806
38071. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3808
3809.. TODO
3810   Specify what affect this has.
3811
3812Assembler
3813---------
3814
3815AMDGPU backend has LLVM-MC based assembler which is currently in development.
3816It supports AMDGCN GFX6-GFX9.
3817
3818This section describes general syntax for instructions and operands. For more
3819information about instructions, their semantics and supported combinations of
3820operands, refer to one of instruction set architecture manuals
3821[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3822
3823An instruction has the following syntax (register operands are normally
3824comma-separated while extra operands are space-separated):
3825
3826*<opcode> <register_operand0>, ... <extra_operand0> ...*
3827
3828Operands
3829~~~~~~~~
3830
3831The following syntax for register operands is supported:
3832
3833* SGPR registers: s0, ... or s[0], ...
3834* VGPR registers: v0, ... or v[0], ...
3835* TTMP registers: ttmp0, ... or ttmp[0], ...
3836* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3837* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3838* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3839* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3840* Register index expressions: v[2*2], s[1-1:2-1]
3841* 'off' indicates that an operand is not enabled
3842
3843The following extra operands are supported:
3844
3845* offset, offset0, offset1
3846* idxen, offen bits
3847* glc, slc, tfe bits
3848* waitcnt: integer or combination of counter values
3849* VOP3 modifiers:
3850
3851  - abs (\| \|), neg (\-)
3852
3853* DPP modifiers:
3854
3855  - row_shl, row_shr, row_ror, row_rol
3856  - row_mirror, row_half_mirror, row_bcast
3857  - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3858  - row_mask, bank_mask, bound_ctrl
3859
3860* SDWA modifiers:
3861
3862  - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3863  - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3864  - abs, neg, sext
3865
3866Instruction Examples
3867~~~~~~~~~~~~~~~~~~~~
3868
3869DS
3870~~
3871
3872.. code-block:: nasm
3873
3874  ds_add_u32 v2, v4 offset:16
3875  ds_write_src2_b64 v2 offset0:4 offset1:8
3876  ds_cmpst_f32 v2, v4, v6
3877  ds_min_rtn_f64 v[8:9], v2, v[4:5]
3878
3879
3880For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3881
3882FLAT
3883++++
3884
3885.. code-block:: nasm
3886
3887  flat_load_dword v1, v[3:4]
3888  flat_store_dwordx3 v[3:4], v[5:7]
3889  flat_atomic_swap v1, v[3:4], v5 glc
3890  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3891  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3892
3893For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3894
3895MUBUF
3896+++++
3897
3898.. code-block:: nasm
3899
3900  buffer_load_dword v1, off, s[4:7], s1
3901  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3902  buffer_store_format_xy v[1:2], off, s[4:7], s1
3903  buffer_wbinvl1
3904  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3905
3906For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3907
3908SMRD/SMEM
3909+++++++++
3910
3911.. code-block:: nasm
3912
3913  s_load_dword s1, s[2:3], 0xfc
3914  s_load_dwordx8 s[8:15], s[2:3], s4
3915  s_load_dwordx16 s[88:103], s[2:3], s4
3916  s_dcache_inv_vol
3917  s_memtime s[4:5]
3918
3919For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3920
3921SOP1
3922++++
3923
3924.. code-block:: nasm
3925
3926  s_mov_b32 s1, s2
3927  s_mov_b64 s[0:1], 0x80000000
3928  s_cmov_b32 s1, 200
3929  s_wqm_b64 s[2:3], s[4:5]
3930  s_bcnt0_i32_b64 s1, s[2:3]
3931  s_swappc_b64 s[2:3], s[4:5]
3932  s_cbranch_join s[4:5]
3933
3934For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3935
3936SOP2
3937++++
3938
3939.. code-block:: nasm
3940
3941  s_add_u32 s1, s2, s3
3942  s_and_b64 s[2:3], s[4:5], s[6:7]
3943  s_cselect_b32 s1, s2, s3
3944  s_andn2_b32 s2, s4, s6
3945  s_lshr_b64 s[2:3], s[4:5], s6
3946  s_ashr_i32 s2, s4, s6
3947  s_bfm_b64 s[2:3], s4, s6
3948  s_bfe_i64 s[2:3], s[4:5], s6
3949  s_cbranch_g_fork s[4:5], s[6:7]
3950
3951For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3952
3953SOPC
3954++++
3955
3956.. code-block:: nasm
3957
3958  s_cmp_eq_i32 s1, s2
3959  s_bitcmp1_b32 s1, s2
3960  s_bitcmp0_b64 s[2:3], s4
3961  s_setvskip s3, s5
3962
3963For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3964
3965SOPP
3966++++
3967
3968.. code-block:: nasm
3969
3970  s_barrier
3971  s_nop 2
3972  s_endpgm
3973  s_waitcnt 0 ; Wait for all counters to be 0
3974  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3975  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3976  s_sethalt 9
3977  s_sleep 10
3978  s_sendmsg 0x1
3979  s_sendmsg sendmsg(MSG_INTERRUPT)
3980  s_trap 1
3981
3982For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3983
3984Unless otherwise mentioned, little verification is performed on the operands
3985of SOPP Instructions, so it is up to the programmer to be familiar with the
3986range or acceptable values.
3987
3988VALU
3989++++
3990
3991For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3992the assembler will automatically use optimal encoding based on its operands.
3993To force specific encoding, one can add a suffix to the opcode of the instruction:
3994
3995* _e32 for 32-bit VOP1/VOP2/VOPC
3996* _e64 for 64-bit VOP3
3997* _dpp for VOP_DPP
3998* _sdwa for VOP_SDWA
3999
4000VOP1/VOP2/VOP3/VOPC examples:
4001
4002.. code-block:: nasm
4003
4004  v_mov_b32 v1, v2
4005  v_mov_b32_e32 v1, v2
4006  v_nop
4007  v_cvt_f64_i32_e32 v[1:2], v2
4008  v_floor_f32_e32 v1, v2
4009  v_bfrev_b32_e32 v1, v2
4010  v_add_f32_e32 v1, v2, v3
4011  v_mul_i32_i24_e64 v1, v2, 3
4012  v_mul_i32_i24_e32 v1, -3, v3
4013  v_mul_i32_i24_e32 v1, -100, v3
4014  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4015  v_max_f16_e32 v1, v2, v3
4016
4017VOP_DPP examples:
4018
4019.. code-block:: nasm
4020
4021  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4022  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4023  v_mov_b32 v0, v0 wave_shl:1
4024  v_mov_b32 v0, v0 row_mirror
4025  v_mov_b32 v0, v0 row_bcast:31
4026  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4027  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4028  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4029
4030VOP_SDWA examples:
4031
4032.. code-block:: nasm
4033
4034  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4035  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4036  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4037  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4038  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4039
4040For full list of supported instructions, refer to "Vector ALU instructions".
4041
4042HSA Code Object Directives
4043~~~~~~~~~~~~~~~~~~~~~~~~~~
4044
4045AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4046one can specify them with assembler directives.
4047
4048.hsa_code_object_version major, minor
4049+++++++++++++++++++++++++++++++++++++
4050
4051*major* and *minor* are integers that specify the version of the HSA code
4052object that will be generated by the assembler.
4053
4054.hsa_code_object_isa [major, minor, stepping, vendor, arch]
4055+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4056
4057
4058*major*, *minor*, and *stepping* are all integers that describe the instruction
4059set architecture (ISA) version of the assembly program.
4060
4061*vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4062"AMD" and *arch* should always be equal to "AMDGPU".
4063
4064By default, the assembler will derive the ISA version, *vendor*, and *arch*
4065from the value of the -mcpu option that is passed to the assembler.
4066
4067.amdgpu_hsa_kernel (name)
4068+++++++++++++++++++++++++
4069
4070This directives specifies that the symbol with given name is a kernel entry point
4071(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4072
4073.amd_kernel_code_t
4074++++++++++++++++++
4075
4076This directive marks the beginning of a list of key / value pairs that are used
4077to specify the amd_kernel_code_t object that will be emitted by the assembler.
4078The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4079any amd_kernel_code_t values that are unspecified a default value will be
4080used.  The default value for all keys is 0, with the following exceptions:
4081
4082- *kernel_code_version_major* defaults to 1.
4083- *machine_kind* defaults to 1.
4084- *machine_version_major*, *machine_version_minor*, and
4085  *machine_version_stepping* are derived from the value of the -mcpu option
4086  that is passed to the assembler.
4087- *kernel_code_entry_byte_offset* defaults to 256.
4088- *wavefront_size* defaults to 6.
4089- *kernarg_segment_alignment*, *group_segment_alignment*, and
4090  *private_segment_alignment* default to 4. Note that alignments are specified
4091  as a power of two, so a value of **n** means an alignment of 2^ **n**.
4092
4093The *.amd_kernel_code_t* directive must be placed immediately after the
4094function label and before any instructions.
4095
4096For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4097comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4098
4099Here is an example of a minimal amd_kernel_code_t specification:
4100
4101.. code-block:: none
4102
4103   .hsa_code_object_version 1,0
4104   .hsa_code_object_isa
4105
4106   .hsatext
4107   .globl  hello_world
4108   .p2align 8
4109   .amdgpu_hsa_kernel hello_world
4110
4111   hello_world:
4112
4113      .amd_kernel_code_t
4114         enable_sgpr_kernarg_segment_ptr = 1
4115         is_ptr64 = 1
4116         compute_pgm_rsrc1_vgprs = 0
4117         compute_pgm_rsrc1_sgprs = 0
4118         compute_pgm_rsrc2_user_sgpr = 2
4119         kernarg_segment_byte_size = 8
4120         wavefront_sgpr_count = 2
4121         workitem_vgpr_count = 3
4122     .end_amd_kernel_code_t
4123
4124     s_load_dwordx2 s[0:1], s[0:1] 0x0
4125     v_mov_b32 v0, 3.14159
4126     s_waitcnt lgkmcnt(0)
4127     v_mov_b32 v1, s0
4128     v_mov_b32 v2, s1
4129     flat_store_dword v[1:2], v0
4130     s_endpgm
4131   .Lfunc_end0:
4132        .size   hello_world, .Lfunc_end0-hello_world
4133
4134Additional Documentation
4135========================
4136
4137.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4138.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4139.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4140.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4141.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4142.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4143.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4144.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4145.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4146.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4147.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4148.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4149.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4150.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4151.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4152.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4153