1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX10
20   AMDGPU/AMDGPUAsmGFX1011
21   AMDGPUModifierSyntax
22   AMDGPUOperandSyntax
23   AMDGPUInstructionSyntax
24   AMDGPUInstructionNotation
25   AMDGPUDwarfExtensionsForHeterogeneousDebugging
26   AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
27
28Introduction
29============
30
31The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
32R600 family up until the current GCN families. It lives in the
33``llvm/lib/Target/AMDGPU`` directory.
34
35LLVM
36====
37
38.. _amdgpu-target-triples:
39
40Target Triples
41--------------
42
43Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
44to specify the target triple:
45
46  .. table:: AMDGPU Architectures
47     :name: amdgpu-architecture-table
48
49     ============ ==============================================================
50     Architecture Description
51     ============ ==============================================================
52     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
53     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
54     ============ ==============================================================
55
56  .. table:: AMDGPU Vendors
57     :name: amdgpu-vendor-table
58
59     ============ ==============================================================
60     Vendor       Description
61     ============ ==============================================================
62     ``amd``      Can be used for all AMD GPU usage.
63     ``mesa3d``   Can be used if the OS is ``mesa3d``.
64     ============ ==============================================================
65
66  .. table:: AMDGPU Operating Systems
67     :name: amdgpu-os
68
69     ============== ============================================================
70     OS             Description
71     ============== ============================================================
72     *<empty>*      Defaults to the *unknown* OS.
73     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
74                    such as:
75
76                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
77                      loader on Linux. See *AMD ROCm Platform Release Notes*
78                      [AMD-ROCm-Release-Notes]_ for supported hardware and
79                      software.
80                    - AMD's PAL runtime using the *pal-amdhsa* loader on
81                      Windows.
82
83     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
84                    runtime using the *pal-amdpal* loader on Windows and Linux
85                    Pro.
86     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
87                    3D runtime using the *mesa-mesa3d* loader on Linux.
88     ============== ============================================================
89
90  .. table:: AMDGPU Environments
91     :name: amdgpu-environment-table
92
93     ============ ==============================================================
94     Environment  Description
95     ============ ==============================================================
96     *<empty>*    Default.
97     ============ ==============================================================
98
99.. _amdgpu-processors:
100
101Processors
102----------
103
104Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
105specify the AMDGPU processor together with optional target features. See
106:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
107specific information.
108
109Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
110
111* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
112
113
114  .. table:: AMDGPU Processors
115     :name: amdgpu-processor-table
116
117     =========== =============== ============ ===== ================= =============== =============== ======================
118     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
119                 Processor       Triple       APU   Features          Properties      *(see*          Products
120                                 Architecture       Supported                         `amdgpu-os`_
121                                                                                      *and
122                                                                                      corresponding
123                                                                                      runtime release
124                                                                                      notes for
125                                                                                      current
126                                                                                      information and
127                                                                                      level of
128                                                                                      support)*
129     =========== =============== ============ ===== ================= =============== =============== ======================
130     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
131     -----------------------------------------------------------------------------------------------------------------------
132     ``r600``                    ``r600``     dGPU                    - Does not
133                                                                        support
134                                                                        generic
135                                                                        address
136                                                                        space
137     ``r630``                    ``r600``     dGPU                    - Does not
138                                                                        support
139                                                                        generic
140                                                                        address
141                                                                        space
142     ``rs880``                   ``r600``     dGPU                    - Does not
143                                                                        support
144                                                                        generic
145                                                                        address
146                                                                        space
147     ``rv670``                   ``r600``     dGPU                    - Does not
148                                                                        support
149                                                                        generic
150                                                                        address
151                                                                        space
152     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
153     -----------------------------------------------------------------------------------------------------------------------
154     ``rv710``                   ``r600``     dGPU                    - Does not
155                                                                        support
156                                                                        generic
157                                                                        address
158                                                                        space
159     ``rv730``                   ``r600``     dGPU                    - Does not
160                                                                        support
161                                                                        generic
162                                                                        address
163                                                                        space
164     ``rv770``                   ``r600``     dGPU                    - Does not
165                                                                        support
166                                                                        generic
167                                                                        address
168                                                                        space
169     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
170     -----------------------------------------------------------------------------------------------------------------------
171     ``cedar``                   ``r600``     dGPU                    - Does not
172                                                                        support
173                                                                        generic
174                                                                        address
175                                                                        space
176     ``cypress``                 ``r600``     dGPU                    - Does not
177                                                                        support
178                                                                        generic
179                                                                        address
180                                                                        space
181     ``juniper``                 ``r600``     dGPU                    - Does not
182                                                                        support
183                                                                        generic
184                                                                        address
185                                                                        space
186     ``redwood``                 ``r600``     dGPU                    - Does not
187                                                                        support
188                                                                        generic
189                                                                        address
190                                                                        space
191     ``sumo``                    ``r600``     dGPU                    - Does not
192                                                                        support
193                                                                        generic
194                                                                        address
195                                                                        space
196     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
197     -----------------------------------------------------------------------------------------------------------------------
198     ``barts``                   ``r600``     dGPU                    - Does not
199                                                                        support
200                                                                        generic
201                                                                        address
202                                                                        space
203     ``caicos``                  ``r600``     dGPU                    - Does not
204                                                                        support
205                                                                        generic
206                                                                        address
207                                                                        space
208     ``cayman``                  ``r600``     dGPU                    - Does not
209                                                                        support
210                                                                        generic
211                                                                        address
212                                                                        space
213     ``turks``                   ``r600``     dGPU                    - Does not
214                                                                        support
215                                                                        generic
216                                                                        address
217                                                                        space
218     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
219     -----------------------------------------------------------------------------------------------------------------------
220     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
221                                                                        support
222                                                                        generic
223                                                                        address
224                                                                        space
225     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
226                 - ``verde``                                            support
227                                                                        generic
228                                                                        address
229                                                                        space
230     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
231                 - ``oland``                                            support
232                                                                        generic
233                                                                        address
234                                                                        space
235     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
236     -----------------------------------------------------------------------------------------------------------------------
237     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
238                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
239                                                                        scratch       - *pal-amdpal*  - A8-7100
240                                                                                                      - A8 Pro-7150B
241                                                                                                      - A10-7300
242                                                                                                      - A10 Pro-7350B
243                                                                                                      - FX-7500
244                                                                                                      - A8-7200P
245                                                                                                      - A10-7400P
246                                                                                                      - FX-7600P
247     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
248                                                                        flat          - *pal-amdhsa*  - FirePro W9100
249                                                                        scratch       - *pal-amdpal*  - FirePro S9150
250                                                                                                      - FirePro S9170
251     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
252                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
253                                                                        scratch       - *pal-amdpal*  - Radeon R390
254                                                                                                      - Radeon R390x
255     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
256                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
257                                                                        scratch                       - E1-2500
258                                                                                                      - E2-3000
259                                                                                                      - E2-3800
260                                                                                                      - A4-5000
261                                                                                                      - A4-5100
262                                                                                                      - A6-5200
263                                                                                                      - A4 Pro-3340B
264     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
265                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
266                                                                        scratch                       - R7 260
267                                                                                                      - R7 260X
268     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
269                                                                        flat          - *pal-amdpal*
270                                                                        scratch                       .. TODO::
271
272                                                                                                        Add product
273                                                                                                        names.
274
275     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
276     -----------------------------------------------------------------------------------------------------------------------
277     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
278                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
279                                                                        scratch       - *pal-amdpal*  - A8-8600P
280                                                                                                      - Pro A8-8600B
281                                                                                                      - FX-8800P
282                                                                                                      - Pro A12-8800B
283                                                                                                      - A10-8700P
284                                                                                                      - Pro A10-8700B
285                                                                                                      - A10-8780P
286                                                                                                      - A10-9600P
287                                                                                                      - A10-9630P
288                                                                                                      - A12-9700P
289                                                                                                      - A12-9730P
290                                                                                                      - FX-9800P
291                                                                                                      - FX-9830P
292                                                                                                      - E2-9010
293                                                                                                      - A6-9210
294                                                                                                      - A9-9410
295     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
296                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
297                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
298     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
299                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
300                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
301                                                                                                      - Radeon Pro Duo
302                                                                                                      - FirePro S9300x2
303                                                                                                      - Radeon Instinct MI8
304     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
305                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
306                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
307     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
308                                                                        flat          - *pal-amdhsa*
309                                                                        scratch       - *pal-amdpal*
310     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
311                                                                        flat          - *pal-amdhsa*  - FirePro S7100
312                                                                        scratch       - *pal-amdpal*  - FirePro W7100
313                                                                                                      - Mobile FirePro
314                                                                                                        M7170
315     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
316                                                                        flat          - *pal-amdhsa*
317                                                                        scratch       - *pal-amdpal*  .. TODO::
318
319                                                                                                        Add product
320                                                                                                        names.
321
322     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
323     -----------------------------------------------------------------------------------------------------------------------
324     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
325                                                                        flat          - *pal-amdhsa*    Frontier Edition
326                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
327                                                                                                      - Radeon RX Vega 64
328                                                                                                      - Radeon RX Vega 64
329                                                                                                        Liquid
330                                                                                                      - Radeon Instinct MI25
331     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
332                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
333                                                                        scratch       - *pal-amdpal*
334     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
335                                                                                      - *pal-amdhsa*
336                                                                                      - *pal-amdpal*  .. TODO::
337
338                                                                                                        Add product
339                                                                                                        names.
340
341     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
342                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
343                                                                        scratch       - *pal-amdpal*  - Radeon VII
344                                                                                                      - Radeon Pro VII
345     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
346                                                    - xnack           - Absolute
347                                                                        flat
348                                                                        scratch
349     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
350                                                                        flat
351                                                                        scratch                       .. TODO::
352
353                                                                                                        Add product
354                                                                                                        names.
355
356     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
357                                                    - tgsplit           flat
358                                                    - xnack             scratch                       .. TODO::
359                                                                      - Packed
360                                                                        work-item                       Add product
361                                                                        IDs                             names.
362
363     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
364                                                                        flat                          - Ryzen 7 4700GE
365                                                                        scratch                       - Ryzen 5 4600G
366                                                                                                      - Ryzen 5 4600GE
367                                                                                                      - Ryzen 3 4300G
368                                                                                                      - Ryzen 3 4300GE
369                                                                                                      - Ryzen Pro 4000G
370                                                                                                      - Ryzen 7 Pro 4700G
371                                                                                                      - Ryzen 7 Pro 4750GE
372                                                                                                      - Ryzen 5 Pro 4650G
373                                                                                                      - Ryzen 5 Pro 4650GE
374                                                                                                      - Ryzen 3 Pro 4350G
375                                                                                                      - Ryzen 3 Pro 4350GE
376
377     **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
378     -----------------------------------------------------------------------------------------------------------------------
379     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
380                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
381                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
382                                                                                                      - Radeon Pro 5600M
383     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
384                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
385                                                    - xnack             flat          - *pal-amdpal*
386                                                                        scratch
387     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
388                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
389                                                    - xnack             scratch       - *pal-amdpal*
390     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
391                                                    - wavefrontsize64   flat          - *pal-amdhsa*
392                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
393
394                                                                                                        Add product
395                                                                                                        names.
396
397     **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
398     -----------------------------------------------------------------------------------------------------------------------
399     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
400                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
401                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
402     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
403                                                    - wavefrontsize64   flat          - *pal-amdhsa*
404                                                                        scratch       - *pal-amdpal*
405     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
406                                                    - wavefrontsize64   flat          - *pal-amdhsa*
407                                                                        scratch       - *pal-amdpal*  .. TODO::
408
409                                                                                                        Add product
410                                                                                                        names.
411
412     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
413                                                    - wavefrontsize64   flat
414                                                                        scratch                       .. TODO::
415
416                                                                                                        Add product
417                                                                                                        names.
418     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
419                                                    - wavefrontsize64   flat
420                                                                        scratch                       .. TODO::
421
422                                                                                                        Add product
423                                                                                                        names.
424
425     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
426                                                    - wavefrontsize64   flat
427                                                                        scratch                       .. TODO::
428                                                                                                        Add product
429                                                                                                        names.
430
431     =========== =============== ============ ===== ================= =============== =============== ======================
432
433.. _amdgpu-target-features:
434
435Target Features
436---------------
437
438Target features control how code is generated to support certain
439processor specific features. Not all target features are supported by
440all processors. The runtime must ensure that the features supported by
441the device used to execute the code match the features enabled when
442generating the code. A mismatch of features may result in incorrect
443execution, or a reduction in performance.
444
445The target features supported by each processor is listed in
446:ref:`amdgpu-processor-table`.
447
448Target features are controlled by exactly one of the following Clang
449options:
450
451``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
452
453  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
454  optional components of the target ID. If omitted, the target feature has the
455  ``any`` value. See :ref:`amdgpu-target-id`.
456
457``-m[no-]<target-feature>``
458
459  Target features not specified by the target ID are specified using a
460  separate option. These target features can have an ``on`` or ``off``
461  value.  ``on`` is specified by omitting the ``no-`` prefix, and
462  ``off`` is specified by including the ``no-`` prefix. The default
463  if not specified is ``off``.
464
465For example:
466
467``-mcpu=gfx908:xnack+``
468  Enable the ``xnack`` feature.
469``-mcpu=gfx908:xnack-``
470  Disable the ``xnack`` feature.
471``-mcumode``
472  Enable the ``cumode`` feature.
473``-mno-cumode``
474  Disable the ``cumode`` feature.
475
476  .. table:: AMDGPU Target Features
477     :name: amdgpu-target-features-table
478
479     =============== ============================ ==================================================
480     Target Feature  Clang Option to Control      Description
481     Name
482     =============== ============================ ==================================================
483     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
484                                                  when generating code for kernels. When disabled
485                                                  native WGP wavefront execution mode is used,
486                                                  when enabled CU wavefront execution mode is used
487                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
488
489     sramecc         - ``-mcpu``                  If specified, generate code that can only be
490                     - ``--offload-arch``         loaded and executed in a process that has a
491                                                  matching setting for SRAMECC.
492
493                                                  If not specified for code object V2 to V3, generate
494                                                  code that can be loaded and executed in a process
495                                                  with SRAMECC enabled.
496
497                                                  If not specified for code object V4, generate
498                                                  code that can be loaded and executed in a process
499                                                  with either setting of SRAMECC.
500
501     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
502                                                  work-groups are launched in threadgroup split mode.
503                                                  When enabled the waves of a work-group may be
504                                                  launched in different CUs.
505
506     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
507                                                  generating code for kernels. When disabled
508                                                  native wavefront size 32 is used, when enabled
509                                                  wavefront size 64 is used.
510
511     xnack           - ``-mcpu``                  If specified, generate code that can only be
512                     - ``--offload-arch``         loaded and executed in a process that has a
513                                                  matching setting for XNACK replay.
514
515                                                  If not specified for code object V2 to V3, generate
516                                                  code that can be loaded and executed in a process
517                                                  with XNACK replay enabled.
518
519                                                  If not specified for code object V4, generate
520                                                  code that can be loaded and executed in a process
521                                                  with either setting of XNACK replay.
522
523                                                  XNACK replay can be used for demand paging and
524                                                  page migration. If enabled in the device, then if
525                                                  a page fault occurs the code may execute
526                                                  incorrectly unless generated with XNACK replay
527                                                  enabled, or generated for code object V4 without
528                                                  specifying XNACK replay. Executing code that was
529                                                  generated with XNACK replay enabled, or generated
530                                                  for code object V4 without specifying XNACK replay,
531                                                  on a device that does not have XNACK replay
532                                                  enabled will execute correctly but may be less
533                                                  performant than code generated for XNACK replay
534                                                  disabled.
535     =============== ============================ ==================================================
536
537.. _amdgpu-target-id:
538
539Target ID
540---------
541
542AMDGPU supports target IDs. See `Clang Offload Bundler
543<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
544description. The AMDGPU target specific information is:
545
546**processor**
547  Is an AMDGPU processor or alternative processor name specified in
548  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
549  the primary processor and alternative processor names. The canonical form
550  target ID only allow the primary processor name.
551
552**target-feature**
553  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
554  is supported by the processor. The target features supported by each processor
555  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
556  a target ID are marked as being controlled by ``-mcpu`` and
557  ``--offload-arch``. Each target feature must appear at most once in a target
558  ID. The non-canonical form target ID allows the target features to be
559  specified in any order. The canonical form target ID requires the target
560  features to be specified in alphabetic order.
561
562.. _amdgpu-target-id-v2-v3:
563
564Code Object V2 to V3 Target ID
565~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
566
567The target ID syntax for code object V2 to V3 is the same as defined in `Clang
568Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
569when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
570directive and the bundle entry ID. In those cases it has the following BNF
571syntax:
572
573.. code::
574
575  <target-id> ::== <processor> ( "+" <target-feature> )*
576
577Where a target feature is omitted if *Off* and present if *On* or *Any*.
578
579.. note::
580
581  The code object V2 to V3 cannot represent *Any* and treats it the same as
582  *On*.
583
584.. _amdgpu-embedding-bundled-objects:
585
586Embedding Bundled Code Objects
587------------------------------
588
589AMDGPU supports the HIP and OpenMP languages that perform code object embedding
590as described in `Clang Offload Bundler
591<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
592
593.. note::
594
595  The target ID syntax used for code object V2 to V3 for a bundle entry ID
596  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
597
598.. _amdgpu-address-spaces:
599
600Address Spaces
601--------------
602
603The AMDGPU architecture supports a number of memory address spaces. The address
604space names use the OpenCL standard names, with some additions.
605
606The AMDGPU address spaces correspond to target architecture specific LLVM
607address space numbers used in LLVM IR.
608
609The AMDGPU address spaces are described in
610:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
611supported for the ``amdgcn`` target.
612
613  .. table:: AMDGPU Address Spaces
614     :name: amdgpu-address-spaces-table
615
616     ================================= =============== =========== ================ ======= ============================
617     ..                                                                                     64-Bit Process Address Space
618     --------------------------------- --------------- ----------- ---------------- ------------------------------------
619     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
620                                       Space Number    Name        Name             Size
621     ================================= =============== =========== ================ ======= ============================
622     Generic                           0               flat        flat             64      0x0000000000000000
623     Global                            1               global      global           64      0x0000000000000000
624     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
625     Local                             3               group       LDS              32      0xFFFFFFFF
626     Constant                          4               constant    *same as global* 64      0x0000000000000000
627     Private                           5               private     scratch          32      0xFFFFFFFF
628     Constant 32-bit                   6               *TODO*                               0x00000000
629     Buffer Fat Pointer (experimental) 7               *TODO*
630     ================================= =============== =========== ================ ======= ============================
631
632**Generic**
633  The generic address space is supported unless the *Target Properties* column
634  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
635  space*.
636
637  The generic address space uses the hardware flat address support for two fixed
638  ranges of virtual addresses (the private and local apertures), that are
639  outside the range of addressable global memory, to map from a flat address to
640  a private or local address. This uses FLAT instructions that can take a flat
641  address and access global, private (scratch), and group (LDS) memory depending
642  on if the address is within one of the aperture ranges.
643
644  Flat access to scratch requires hardware aperture setup and setup in the
645  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
646  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
647  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
648
649  To convert between a private or group address space address (termed a segment
650  address) and a flat address the base address of the corresponding aperture
651  can be used. For GFX7-GFX8 these are available in the
652  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
653  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
654  GFX9-GFX10 the aperture base addresses are directly available as inline
655  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
656  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
657  aligned to 2^32 which makes it easier to convert from flat to segment or
658  segment to flat.
659
660  A global address space address has the same value when used as a flat address
661  so no conversion is needed.
662
663**Global and Constant**
664  The global and constant address spaces both use global virtual addresses,
665  which are the same virtual address space used by the CPU. However, some
666  virtual addresses may only be accessible to the CPU, some only accessible
667  by the GPU, and some by both.
668
669  Using the constant address space indicates that the data will not change
670  during the execution of the kernel. This allows scalar read instructions to
671  be used. As the constant address space could only be modified on the host
672  side, a generic pointer loaded from the constant address space is safe to be
673  assumed as a global pointer since only the device global memory is visible
674  and managed on the host side. The vector and scalar L1 caches are invalidated
675  of volatile data before each kernel dispatch execution to allow constant
676  memory to change values between kernel dispatches.
677
678**Region**
679  The region address space uses the hardware Global Data Store (GDS). All
680  wavefronts executing on the same device will access the same memory for any
681  given region address. However, the same region address accessed by wavefronts
682  executing on different devices will access different memory. It is higher
683  performance than global memory. It is allocated by the runtime. The data
684  store (DS) instructions can be used to access it.
685
686**Local**
687  The local address space uses the hardware Local Data Store (LDS) which is
688  automatically allocated when the hardware creates the wavefronts of a
689  work-group, and freed when all the wavefronts of a work-group have
690  terminated. All wavefronts belonging to the same work-group will access the
691  same memory for any given local address. However, the same local address
692  accessed by wavefronts belonging to different work-groups will access
693  different memory. It is higher performance than global memory. The data store
694  (DS) instructions can be used to access it.
695
696**Private**
697  The private address space uses the hardware scratch memory support which
698  automatically allocates memory when it creates a wavefront and frees it when
699  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
700  given private address will be different to the memory accessed by another lane
701  of the same or different wavefront for the same private address.
702
703  If a kernel dispatch uses scratch, then the hardware allocates memory from a
704  pool of backing memory allocated by the runtime for each wavefront. The lanes
705  of the wavefront access this using dword (4 byte) interleaving. The mapping
706  used from private address to backing memory address is:
707
708    ``wavefront-scratch-base +
709    ((private-address / 4) * wavefront-size * 4) +
710    (wavefront-lane-id * 4) + (private-address % 4)``
711
712  If each lane of a wavefront accesses the same private address, the
713  interleaving results in adjacent dwords being accessed and hence requires
714  fewer cache lines to be fetched.
715
716  There are different ways that the wavefront scratch base address is
717  determined by a wavefront (see
718  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
719
720  Scratch memory can be accessed in an interleaved manner using buffer
721  instructions with the scratch buffer descriptor and per wavefront scratch
722  offset, by the scratch instructions, or by flat instructions. Multi-dword
723  access is not supported except by flat and scratch instructions in
724  GFX9-GFX10.
725
726**Constant 32-bit**
727  *TODO*
728
729**Buffer Fat Pointer**
730  The buffer fat pointer is an experimental address space that is currently
731  unsupported in the backend. It exposes a non-integral pointer that is in
732  the future intended to support the modelling of 128-bit buffer descriptors
733  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
734  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
735  model the buffer descriptors used heavily in graphics workloads targeting
736  the backend.
737
738.. _amdgpu-memory-scopes:
739
740Memory Scopes
741-------------
742
743This section provides LLVM memory synchronization scopes supported by the AMDGPU
744backend memory model when the target triple OS is ``amdhsa`` (see
745:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
746
747The memory model supported is based on the HSA memory model [HSA]_ which is
748based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
749relation is transitive over the synchronizes-with relation independent of scope
750and synchronizes-with allows the memory scope instances to be inclusive (see
751table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
752
753This is different to the OpenCL [OpenCL]_ memory model which does not have scope
754inclusion and requires the memory scopes to exactly match. However, this
755is conservatively correct for OpenCL.
756
757  .. table:: AMDHSA LLVM Sync Scopes
758     :name: amdgpu-amdhsa-llvm-sync-scopes-table
759
760     ======================= ===================================================
761     LLVM Sync Scope         Description
762     ======================= ===================================================
763     *none*                  The default: ``system``.
764
765                             Synchronizes with, and participates in modification
766                             and seq_cst total orderings with, other operations
767                             (except image operations) for all address spaces
768                             (except private, or generic that accesses private)
769                             provided the other operation's sync scope is:
770
771                             - ``system``.
772                             - ``agent`` and executed by a thread on the same
773                               agent.
774                             - ``workgroup`` and executed by a thread in the
775                               same work-group.
776                             - ``wavefront`` and executed by a thread in the
777                               same wavefront.
778
779     ``agent``               Synchronizes with, and participates in modification
780                             and seq_cst total orderings with, other operations
781                             (except image operations) for all address spaces
782                             (except private, or generic that accesses private)
783                             provided the other operation's sync scope is:
784
785                             - ``system`` or ``agent`` and executed by a thread
786                               on the same agent.
787                             - ``workgroup`` and executed by a thread in the
788                               same work-group.
789                             - ``wavefront`` and executed by a thread in the
790                               same wavefront.
791
792     ``workgroup``           Synchronizes with, and participates in modification
793                             and seq_cst total orderings with, other operations
794                             (except image operations) for all address spaces
795                             (except private, or generic that accesses private)
796                             provided the other operation's sync scope is:
797
798                             - ``system``, ``agent`` or ``workgroup`` and
799                               executed by a thread in the same work-group.
800                             - ``wavefront`` and executed by a thread in the
801                               same wavefront.
802
803     ``wavefront``           Synchronizes with, and participates in modification
804                             and seq_cst total orderings with, other operations
805                             (except image operations) for all address spaces
806                             (except private, or generic that accesses private)
807                             provided the other operation's sync scope is:
808
809                             - ``system``, ``agent``, ``workgroup`` or
810                               ``wavefront`` and executed by a thread in the
811                               same wavefront.
812
813     ``singlethread``        Only synchronizes with and participates in
814                             modification and seq_cst total orderings with,
815                             other operations (except image operations) running
816                             in the same thread for all address spaces (for
817                             example, in signal handlers).
818
819     ``one-as``              Same as ``system`` but only synchronizes with other
820                             operations within the same address space.
821
822     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
823                             operations within the same address space.
824
825     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
826                             other operations within the same address space.
827
828     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
829                             other operations within the same address space.
830
831     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
832                             other operations within the same address space.
833     ======================= ===================================================
834
835LLVM IR Intrinsics
836------------------
837
838The AMDGPU backend implements the following LLVM IR intrinsics.
839
840*This section is WIP.*
841
842.. TODO::
843
844   List AMDGPU intrinsics.
845
846LLVM IR Attributes
847------------------
848
849The AMDGPU backend supports the following LLVM IR attributes.
850
851  .. table:: AMDGPU LLVM IR Attributes
852     :name: amdgpu-llvm-ir-attributes-table
853
854     ======================================= ==========================================================
855     LLVM Attribute                          Description
856     ======================================= ==========================================================
857     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
858                                             will be specified when the kernel is dispatched. Generated
859                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
860                                             The implied default value is 1,1024.
861
862     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
863                                             argument block size for the implicit arguments. This
864                                             varies by OS and language (for OpenCL see
865                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
866     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
867                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
868     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
869                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
870     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
871                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
872                                             CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
873                                             and the backend may not be able to satisfy the request. If
874                                             the specified range is incompatible with the function's
875                                             "amdgpu-flat-work-group-size" value, the implied occupancy
876                                             bounds by the workgroup size takes precedence.
877
878     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
879                                             mode register to be set on entry. Overrides the default for
880                                             the calling convention.
881     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
882                                             the mode register to be set on entry. Overrides the default
883                                             for the calling convention.
884
885     "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
886                                             llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
887                                             attribute, or reached through a call site marked with this attribute,
888                                             the value returned by the intrinsic is undefined. The backend can
889                                             generally infer this during code generation, so typically there is no
890                                             benefit to frontends marking functions with this.
891
892     "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
893                                             llvm.amdgcn.workitem.id.y intrinsic.
894
895     "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
896                                             llvm.amdgcn.workitem.id.z intrinsic.
897
898     "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
899                                             llvm.amdgcn.workgroup.id.x intrinsic.
900
901     "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
902                                             llvm.amdgcn.workgroup.id.y intrinsic.
903
904     "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
905                                             llvm.amdgcn.workgroup.id.z intrinsic.
906
907     "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
908                                             llvm.amdgcn.dispatch.ptr intrinsic.
909
910     "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
911                                             llvm.amdgcn.implicitarg.ptr intrinsic.
912
913     "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
914                                             llvm.amdgcn.dispatch.id intrinsic.
915
916     "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
917                                             llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
918                                             attributes, the queue pointer may be required in situations where the
919                                             intrinsic call does not directly appear in the program. Some subtargets
920                                             require the queue pointer for to handle some addrspacecasts, as well
921                                             as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
922                                             llvm.debug intrinsics.
923
924     ======================================= ==========================================================
925
926.. _amdgpu-elf-code-object:
927
928ELF Code Object
929===============
930
931The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
932can be linked by ``lld`` to produce a standard ELF shared code object which can
933be loaded and executed on an AMDGPU target.
934
935.. _amdgpu-elf-header:
936
937Header
938------
939
940The AMDGPU backend uses the following ELF header:
941
942  .. table:: AMDGPU ELF Header
943     :name: amdgpu-elf-header-table
944
945     ========================== ===============================
946     Field                      Value
947     ========================== ===============================
948     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
949     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
950     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
951                                - ``ELFOSABI_AMDGPU_HSA``
952                                - ``ELFOSABI_AMDGPU_PAL``
953                                - ``ELFOSABI_AMDGPU_MESA3D``
954     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
955                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
956                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
957                                - ``ELFABIVERSION_AMDGPU_PAL``
958                                - ``ELFABIVERSION_AMDGPU_MESA3D``
959     ``e_type``                 - ``ET_REL``
960                                - ``ET_DYN``
961     ``e_machine``              ``EM_AMDGPU``
962     ``e_entry``                0
963     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
964                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
965                                and :ref:`amdgpu-elf-header-e_flags-table-v4`
966     ========================== ===============================
967
968..
969
970  .. table:: AMDGPU ELF Header Enumeration Values
971     :name: amdgpu-elf-header-enumeration-values-table
972
973     =============================== =====
974     Name                            Value
975     =============================== =====
976     ``EM_AMDGPU``                   224
977     ``ELFOSABI_NONE``               0
978     ``ELFOSABI_AMDGPU_HSA``         64
979     ``ELFOSABI_AMDGPU_PAL``         65
980     ``ELFOSABI_AMDGPU_MESA3D``      66
981     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
982     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
983     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
984     ``ELFABIVERSION_AMDGPU_PAL``    0
985     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
986     =============================== =====
987
988``e_ident[EI_CLASS]``
989  The ELF class is:
990
991  * ``ELFCLASS32`` for ``r600`` architecture.
992
993  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
994    process address space applications.
995
996``e_ident[EI_DATA]``
997  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
998
999``e_ident[EI_OSABI]``
1000  One of the following AMDGPU target architecture specific OS ABIs
1001  (see :ref:`amdgpu-os`):
1002
1003  * ``ELFOSABI_NONE`` for *unknown* OS.
1004
1005  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1006
1007  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1008
1009  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1010
1011``e_ident[EI_ABIVERSION]``
1012  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1013  object conforms:
1014
1015  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1016    runtime ABI for code object V2. Specify using the Clang option
1017    ``-mcode-object-version=2``.
1018
1019  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1020    runtime ABI for code object V3. Specify using the Clang option
1021    ``-mcode-object-version=3``. This is the default code object
1022    version if not specified.
1023
1024  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1025    runtime ABI for code object V4. Specify using the Clang option
1026    ``-mcode-object-version=4``.
1027
1028  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1029    runtime ABI.
1030
1031  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1032    3D runtime ABI.
1033
1034``e_type``
1035  Can be one of the following values:
1036
1037
1038  ``ET_REL``
1039    The type produced by the AMDGPU backend compiler as it is relocatable code
1040    object.
1041
1042  ``ET_DYN``
1043    The type produced by the linker as it is a shared code object.
1044
1045  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1046
1047``e_machine``
1048  The value ``EM_AMDGPU`` is used for the machine for all processors supported
1049  by the ``r600`` and ``amdgcn`` architectures (see
1050  :ref:`amdgpu-processor-table`). The specific processor is specified in the
1051  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1052  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1053  ``e_flags`` for code object V3 to V4 (see
1054  :ref:`amdgpu-elf-header-e_flags-table-v3` and
1055  :ref:`amdgpu-elf-header-e_flags-table-v4`).
1056
1057``e_entry``
1058  The entry point is 0 as the entry points for individual kernels must be
1059  selected in order to invoke them through AQL packets.
1060
1061``e_flags``
1062  The AMDGPU backend uses the following ELF header flags:
1063
1064  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1065     :name: amdgpu-elf-header-e_flags-v2-table
1066
1067     ===================================== ===== =============================
1068     Name                                  Value Description
1069     ===================================== ===== =============================
1070     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1071                                                 target feature is
1072                                                 enabled for all code
1073                                                 contained in the code object.
1074                                                 If the processor
1075                                                 does not support the
1076                                                 ``xnack`` target
1077                                                 feature then must
1078                                                 be 0.
1079                                                 See
1080                                                 :ref:`amdgpu-target-features`.
1081     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1082                                                 handler is enabled for all
1083                                                 code contained in the code
1084                                                 object. If the processor
1085                                                 does not support a trap
1086                                                 handler then must be 0.
1087                                                 See
1088                                                 :ref:`amdgpu-target-features`.
1089     ===================================== ===== =============================
1090
1091  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1092     :name: amdgpu-elf-header-e_flags-table-v3
1093
1094     ================================= ===== =============================
1095     Name                              Value Description
1096     ================================= ===== =============================
1097     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1098                                             mask for
1099                                             ``EF_AMDGPU_MACH_xxx`` values
1100                                             defined in
1101                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
1102     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1103                                             target feature is
1104                                             enabled for all code
1105                                             contained in the code object.
1106                                             If the processor
1107                                             does not support the
1108                                             ``xnack`` target
1109                                             feature then must
1110                                             be 0.
1111                                             See
1112                                             :ref:`amdgpu-target-features`.
1113     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1114                                             target feature is
1115                                             enabled for all code
1116                                             contained in the code object.
1117                                             If the processor
1118                                             does not support the
1119                                             ``sramecc`` target
1120                                             feature then must
1121                                             be 0.
1122                                             See
1123                                             :ref:`amdgpu-target-features`.
1124     ================================= ===== =============================
1125
1126  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4
1127     :name: amdgpu-elf-header-e_flags-table-v4
1128
1129     ============================================ ===== ===================================
1130     Name                                         Value      Description
1131     ============================================ ===== ===================================
1132     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1133                                                        mask for
1134                                                        ``EF_AMDGPU_MACH_xxx`` values
1135                                                        defined in
1136                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
1137     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1138                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1139                                                        values.
1140     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1141     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1142     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1143     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1144     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1145                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1146                                                        values.
1147     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1148     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1149     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1150     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1151     ============================================ ===== ===================================
1152
1153  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1154     :name: amdgpu-ef-amdgpu-mach-table
1155
1156     ==================================== ========== =============================
1157     Name                                 Value      Description (see
1158                                                     :ref:`amdgpu-processor-table`)
1159     ==================================== ========== =============================
1160     ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1161     ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1162     ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1163     ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1164     ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1165     ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1166     ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1167     ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1168     ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1169     ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1170     ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1171     ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1172     ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1173     ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1174     ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1175     ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1176     ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1177     *reserved*                           0x011 -    Reserved for ``r600``
1178                                          0x01f      architecture processors.
1179     ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1180     ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1181     ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1182     ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1183     ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1184     ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1185     ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1186     *reserved*                           0x027      Reserved.
1187     ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1188     ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1189     ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1190     ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1191     ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1192     ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1193     ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1194     ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1195     ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1196     ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1197     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1198     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1199     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1200     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1201     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1202     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1203     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1204     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1205     ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1206     ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1207     ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1208     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1209     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1210     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1211     *reserved*                           0x040      Reserved.
1212     *reserved*                           0x041      Reserved.
1213     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1214     *reserved*                           0x043      Reserved.
1215     *reserved*                           0x044      Reserved.
1216     *reserved*                           0x045      Reserved.
1217     ==================================== ========== =============================
1218
1219Sections
1220--------
1221
1222An AMDGPU target ELF code object has the standard ELF sections which include:
1223
1224  .. table:: AMDGPU ELF Sections
1225     :name: amdgpu-elf-sections-table
1226
1227     ================== ================ =================================
1228     Name               Type             Attributes
1229     ================== ================ =================================
1230     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1231     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1232     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1233     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1234     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1235     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1236     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1237     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1238     ``.note``          ``SHT_NOTE``     *none*
1239     ``.rela``\ *name*  ``SHT_RELA``     *none*
1240     ``.rela.dyn``      ``SHT_RELA``     *none*
1241     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1242     ``.shstrtab``      ``SHT_STRTAB``   *none*
1243     ``.strtab``        ``SHT_STRTAB``   *none*
1244     ``.symtab``        ``SHT_SYMTAB``   *none*
1245     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1246     ================== ================ =================================
1247
1248These sections have their standard meanings (see [ELF]_) and are only generated
1249if needed.
1250
1251``.debug``\ *\**
1252  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1253  information on the DWARF produced by the AMDGPU backend.
1254
1255``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1256  The standard sections used by a dynamic loader.
1257
1258``.note``
1259  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1260  backend.
1261
1262``.rela``\ *name*, ``.rela.dyn``
1263  For relocatable code objects, *name* is the name of the section that the
1264  relocation records apply. For example, ``.rela.text`` is the section name for
1265  relocation records associated with the ``.text`` section.
1266
1267  For linked shared code objects, ``.rela.dyn`` contains all the relocation
1268  records from each of the relocatable code object's ``.rela``\ *name* sections.
1269
1270  See :ref:`amdgpu-relocation-records` for the relocation records supported by
1271  the AMDGPU backend.
1272
1273``.text``
1274  The executable machine code for the kernels and functions they call. Generated
1275  as position independent code. See :ref:`amdgpu-code-conventions` for
1276  information on conventions used in the isa generation.
1277
1278.. _amdgpu-note-records:
1279
1280Note Records
1281------------
1282
1283The AMDGPU backend code object contains ELF note records in the ``.note``
1284section. The set of generated notes and their semantics depend on the code
1285object version; see :ref:`amdgpu-note-records-v2` and
1286:ref:`amdgpu-note-records-v3-v4`.
1287
1288As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1289must be generated after the ``name`` field to ensure the ``desc`` field is 4
1290byte aligned. In addition, minimal zero-byte padding must be generated to
1291ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1292field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1293alignment.
1294
1295.. _amdgpu-note-records-v2:
1296
1297Code Object V2 Note Records
1298~~~~~~~~~~~~~~~~~~~~~~~~~~~
1299
1300.. warning::
1301  Code object V2 is not the default code object version emitted by
1302  this version of LLVM.
1303
1304The AMDGPU backend code object uses the following ELF note record in the
1305``.note`` section when compiling for code object V2.
1306
1307The note record vendor field is "AMD".
1308
1309Additional note records may be present, but any which are not documented here
1310are deprecated and should not be used.
1311
1312  .. table:: AMDGPU Code Object V2 ELF Note Records
1313     :name: amdgpu-elf-note-records-v2-table
1314
1315     ===== ===================================== ======================================
1316     Name  Type                                  Description
1317     ===== ===================================== ======================================
1318     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1319     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1320                                                 Finalizer and not the LLVM compiler.
1321     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1322     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1323                                                 YAML [YAML]_ textual format.
1324     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1325     ===== ===================================== ======================================
1326
1327..
1328
1329  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1330     :name: amdgpu-elf-note-record-enumeration-values-v2-table
1331
1332     ===================================== =====
1333     Name                                  Value
1334     ===================================== =====
1335     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1336     ``NT_AMD_HSA_HSAIL``                  2
1337     ``NT_AMD_HSA_ISA_VERSION``            3
1338     *reserved*                            4-9
1339     ``NT_AMD_HSA_METADATA``               10
1340     ``NT_AMD_HSA_ISA_NAME``               11
1341     ===================================== =====
1342
1343``NT_AMD_HSA_CODE_OBJECT_VERSION``
1344  Specifies the code object version number. The description field has the
1345  following layout:
1346
1347  .. code:: c
1348
1349    struct amdgpu_hsa_note_code_object_version_s {
1350      uint32_t major_version;
1351      uint32_t minor_version;
1352    };
1353
1354  The ``major_version`` has a value less than or equal to 2.
1355
1356``NT_AMD_HSA_HSAIL``
1357  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1358  field has the following layout:
1359
1360  .. code:: c
1361
1362    struct amdgpu_hsa_note_hsail_s {
1363      uint32_t hsail_major_version;
1364      uint32_t hsail_minor_version;
1365      uint8_t profile;
1366      uint8_t machine_model;
1367      uint8_t default_float_round;
1368    };
1369
1370``NT_AMD_HSA_ISA_VERSION``
1371  Specifies the target ISA version. The description field has the following layout:
1372
1373  .. code:: c
1374
1375    struct amdgpu_hsa_note_isa_s {
1376      uint16_t vendor_name_size;
1377      uint16_t architecture_name_size;
1378      uint32_t major;
1379      uint32_t minor;
1380      uint32_t stepping;
1381      char vendor_and_architecture_name[1];
1382    };
1383
1384  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1385  vendor and architecture names respectively, including the NUL character.
1386
1387  ``vendor_and_architecture_name`` contains the NUL terminates string for the
1388  vendor, immediately followed by the NUL terminated string for the
1389  architecture.
1390
1391  This note record is used by the HSA runtime loader.
1392
1393  Code object V2 only supports a limited number of processors and has fixed
1394  settings for target features. See
1395  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1396  processors and the corresponding target ID. In the table the note record ISA
1397  name is a concatenation of the vendor name, architecture name, major, minor,
1398  and stepping separated by a ":".
1399
1400  The target ID column shows the processor name and fixed target features used
1401  by the LLVM compiler. The LLVM compiler does not generate a
1402  ``NT_AMD_HSA_HSAIL`` note record.
1403
1404  A code object generated by the Finalizer also uses code object V2 and always
1405  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1406  ``sramecc`` target feature is as shown in
1407  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1408  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1409  bit.
1410
1411``NT_AMD_HSA_ISA_NAME``
1412  Specifies the target ISA name as a non-NUL terminated string.
1413
1414  This note record is not used by the HSA runtime loader.
1415
1416  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1417  V2's limited support of processors and fixed settings for target features.
1418
1419  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1420  from the string to the corresponding target ID. If the ``xnack`` target
1421  feature is supported and enabled, the string produced by the LLVM compiler
1422  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1423  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1424
1425``NT_AMD_HSA_METADATA``
1426  Specifies extensible metadata associated with the code objects executed on HSA
1427  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1428  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1429  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1430  metadata string.
1431
1432  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1433     :name: amdgpu-elf-note-record-supported_processors-v2-table
1434
1435     ===================== ==========================
1436     Note Record ISA Name  Target ID
1437     ===================== ==========================
1438     ``AMD:AMDGPU:6:0:0``  ``gfx600``
1439     ``AMD:AMDGPU:6:0:1``  ``gfx601``
1440     ``AMD:AMDGPU:6:0:2``  ``gfx602``
1441     ``AMD:AMDGPU:7:0:0``  ``gfx700``
1442     ``AMD:AMDGPU:7:0:1``  ``gfx701``
1443     ``AMD:AMDGPU:7:0:2``  ``gfx702``
1444     ``AMD:AMDGPU:7:0:3``  ``gfx703``
1445     ``AMD:AMDGPU:7:0:4``  ``gfx704``
1446     ``AMD:AMDGPU:7:0:5``  ``gfx705``
1447     ``AMD:AMDGPU:8:0:0``  ``gfx802``
1448     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1449     ``AMD:AMDGPU:8:0:2``  ``gfx802``
1450     ``AMD:AMDGPU:8:0:3``  ``gfx803``
1451     ``AMD:AMDGPU:8:0:4``  ``gfx803``
1452     ``AMD:AMDGPU:8:0:5``  ``gfx805``
1453     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1454     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1455     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1456     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1457     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1458     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1459     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1460     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1461     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1462     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1463     ===================== ==========================
1464
1465.. _amdgpu-note-records-v3-v4:
1466
1467Code Object V3 to V4 Note Records
1468~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1469
1470The AMDGPU backend code object uses the following ELF note record in the
1471``.note`` section when compiling for code object V3 to V4.
1472
1473The note record vendor field is "AMDGPU".
1474
1475Additional note records may be present, but any which are not documented here
1476are deprecated and should not be used.
1477
1478  .. table:: AMDGPU Code Object V3 to V4 ELF Note Records
1479     :name: amdgpu-elf-note-records-table-v3-v4
1480
1481     ======== ============================== ======================================
1482     Name     Type                           Description
1483     ======== ============================== ======================================
1484     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1485                                             binary format.
1486     ======== ============================== ======================================
1487
1488..
1489
1490  .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values
1491     :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4
1492
1493     ============================== =====
1494     Name                           Value
1495     ============================== =====
1496     *reserved*                     0-31
1497     ``NT_AMDGPU_METADATA``         32
1498     ============================== =====
1499
1500``NT_AMDGPU_METADATA``
1501  Specifies extensible metadata associated with an AMDGPU code object. It is
1502  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1503  :ref:`amdgpu-amdhsa-code-object-metadata-v3` and
1504  :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the
1505  ``amdhsa`` OS.
1506
1507.. _amdgpu-symbols:
1508
1509Symbols
1510-------
1511
1512Symbols include the following:
1513
1514  .. table:: AMDGPU ELF Symbols
1515     :name: amdgpu-elf-symbols-table
1516
1517     ===================== ================== ================ ==================
1518     Name                  Type               Section          Description
1519     ===================== ================== ================ ==================
1520     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1521                                              - ``.rodata``
1522                                              - ``.bss``
1523     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1524     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1525     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1526     ===================== ================== ================ ==================
1527
1528Global variable
1529  Global variables both used and defined by the compilation unit.
1530
1531  If the symbol is defined in the compilation unit then it is allocated in the
1532  appropriate section according to if it has initialized data or is readonly.
1533
1534  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1535  will resolve relocations using the definition provided by another code object
1536  or explicitly defined by the runtime.
1537
1538  If the symbol resides in local/group memory (LDS) then its section is the
1539  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1540  ``st_value`` field describes alignment requirements as it does for common
1541  symbols.
1542
1543  .. TODO::
1544
1545     Add description of linked shared object symbols. Seems undefined symbols
1546     are marked as STT_NOTYPE.
1547
1548Kernel descriptor
1549  Every HSA kernel has an associated kernel descriptor. It is the address of the
1550  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1551  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1552  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1553
1554Kernel entry point
1555  Every HSA kernel also has a symbol for its machine code entry point.
1556
1557.. _amdgpu-relocation-records:
1558
1559Relocation Records
1560------------------
1561
1562AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1563relocatable fields are:
1564
1565``word32``
1566  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1567  alignment. These values use the same byte order as other word values in the
1568  AMDGPU architecture.
1569
1570``word64``
1571  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1572  alignment. These values use the same byte order as other word values in the
1573  AMDGPU architecture.
1574
1575Following notations are used for specifying relocation calculations:
1576
1577**A**
1578  Represents the addend used to compute the value of the relocatable field.
1579
1580**G**
1581  Represents the offset into the global offset table at which the relocation
1582  entry's symbol will reside during execution.
1583
1584**GOT**
1585  Represents the address of the global offset table.
1586
1587**P**
1588  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1589  of the storage unit being relocated (computed using ``r_offset``).
1590
1591**S**
1592  Represents the value of the symbol whose index resides in the relocation
1593  entry. Relocations not using this must specify a symbol index of
1594  ``STN_UNDEF``.
1595
1596**B**
1597  Represents the base address of a loaded executable or shared object which is
1598  the difference between the ELF address and the actual load address.
1599  Relocations using this are only valid in executable or shared objects.
1600
1601The following relocation types are supported:
1602
1603  .. table:: AMDGPU ELF Relocation Records
1604     :name: amdgpu-elf-relocation-records-table
1605
1606     ========================== ======= =====  ==========  ==============================
1607     Relocation Type            Kind    Value  Field       Calculation
1608     ========================== ======= =====  ==========  ==============================
1609     ``R_AMDGPU_NONE``                  0      *none*      *none*
1610     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1611                                Dynamic
1612     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1613                                Dynamic
1614     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1615                                Dynamic
1616     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1617     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1618     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1619                                Dynamic
1620     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1621     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1622     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1623     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1624     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1625     *reserved*                         12
1626     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1627     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1628     ========================== ======= =====  ==========  ==============================
1629
1630``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1631the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1632
1633There is no current OS loader support for 32-bit programs and so
1634``R_AMDGPU_ABS32`` is not used.
1635
1636.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1637
1638Loaded Code Object Path Uniform Resource Identifier (URI)
1639---------------------------------------------------------
1640
1641The AMD GPU code object loader represents the path of the ELF shared object from
1642which the code object was loaded as a textual Uniform Resource Identifier (URI).
1643Note that the code object is the in memory loaded relocated form of the ELF
1644shared object.  Multiple code objects may be loaded at different memory
1645addresses in the same process from the same ELF shared object.
1646
1647The loaded code object path URI syntax is defined by the following BNF syntax:
1648
1649.. code::
1650
1651  code_object_uri ::== file_uri | memory_uri
1652  file_uri        ::== "file://" file_path [ range_specifier ]
1653  memory_uri      ::== "memory://" process_id range_specifier
1654  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1655  file_path       ::== URI_ENCODED_OS_FILE_PATH
1656  process_id      ::== DECIMAL_NUMBER
1657  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1658
1659**number**
1660  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1661  and octal values by "0".
1662
1663**file_path**
1664  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1665  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1666  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1667  the path are separated by "/".
1668
1669**offset**
1670  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1671  is from the start of the file specified by the ``file_path``, and if omitted
1672  defaults to 0. For a memory URI, it is the memory address and is required.
1673
1674**size**
1675  Is the number of bytes in the code object.  For a file URI, if omitted it
1676  defaults to the size of the file.  It is required for a memory URI.
1677
1678**process_id**
1679  Is the identity of the process owning the memory.  For Linux it is the C
1680  unsigned integral decimal literal for the process ID (PID).
1681
1682For example:
1683
1684.. code::
1685
1686  file:///dir1/dir2/file1
1687  file:///dir3/dir4/file2#offset=0x2000&size=3000
1688  memory://1234#offset=0x20000&size=3000
1689
1690.. _amdgpu-dwarf-debug-information:
1691
1692DWARF Debug Information
1693=======================
1694
1695.. warning::
1696
1697   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1698   is not currently fully implemented and is subject to change.
1699
1700AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1701:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1702object executable code and data to the source language constructs. It can be
1703used by tools such as debuggers and profilers. It uses features defined in
1704:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1705DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1706
1707This section defines the AMDGPU target architecture specific DWARF mappings.
1708
1709.. _amdgpu-dwarf-register-identifier:
1710
1711Register Identifier
1712-------------------
1713
1714This section defines the AMDGPU target architecture register numbers used in
1715DWARF operation expressions (see DWARF Version 5 section 2.5 and
1716:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1717instructions (see DWARF Version 5 section 6.4 and
1718:ref:`amdgpu-dwarf-call-frame-information`).
1719
1720A single code object can contain code for kernels that have different wavefront
1721sizes. The vector registers and some scalar registers are based on the wavefront
1722size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1723simplifies the consumer of the DWARF so that each register has a fixed size,
1724rather than being dynamic according to the wavefront size mode. Similarly,
1725distinct DWARF registers are defined for those registers that vary in size
1726according to the process address size. This allows a consumer to treat a
1727specific AMDGPU processor as a single architecture regardless of how it is
1728configured at run time. The compiler explicitly specifies the DWARF registers
1729that match the mode in which the code it is generating will be executed.
1730
1731DWARF registers are encoded as numbers, which are mapped to architecture
1732registers. The mapping for AMDGPU is defined in
1733:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1734mapping.
1735
1736.. table:: AMDGPU DWARF Register Mapping
1737   :name: amdgpu-dwarf-register-mapping-table
1738
1739   ============== ================= ======== ==================================
1740   DWARF Register AMDGPU Register   Bit Size Description
1741   ============== ================= ======== ==================================
1742   0              PC_32             32       Program Counter (PC) when
1743                                             executing in a 32-bit process
1744                                             address space. Used in the CFI to
1745                                             describe the PC of the calling
1746                                             frame.
1747   1              EXEC_MASK_32      32       Execution Mask Register when
1748                                             executing in wavefront 32 mode.
1749   2-15           *Reserved*                 *Reserved for highly accessed
1750                                             registers using DWARF shortcut.*
1751   16             PC_64             64       Program Counter (PC) when
1752                                             executing in a 64-bit process
1753                                             address space. Used in the CFI to
1754                                             describe the PC of the calling
1755                                             frame.
1756   17             EXEC_MASK_64      64       Execution Mask Register when
1757                                             executing in wavefront 64 mode.
1758   18-31          *Reserved*                 *Reserved for highly accessed
1759                                             registers using DWARF shortcut.*
1760   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1761                                             Registers.
1762   96-127         *Reserved*                 *Reserved for frequently accessed
1763                                             registers using DWARF 1-byte ULEB.*
1764   128            STATUS            32       Status Register.
1765   129-511        *Reserved*                 *Reserved for future Scalar
1766                                             Architectural Registers.*
1767   512            VCC_32            32       Vector Condition Code Register
1768                                             when executing in wavefront 32
1769                                             mode.
1770   513-767        *Reserved*                 *Reserved for future Vector
1771                                             Architectural Registers when
1772                                             executing in wavefront 32 mode.*
1773   768            VCC_64            64       Vector Condition Code Register
1774                                             when executing in wavefront 64
1775                                             mode.
1776   769-1023       *Reserved*                 *Reserved for future Vector
1777                                             Architectural Registers when
1778                                             executing in wavefront 64 mode.*
1779   1024-1087      *Reserved*                 *Reserved for padding.*
1780   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1781   1130-1535      *Reserved*                 *Reserved for future Scalar
1782                                             General Purpose Registers.*
1783   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1784                                             when executing in wavefront 32
1785                                             mode.
1786   1792-2047      *Reserved*                 *Reserved for future Vector
1787                                             General Purpose Registers when
1788                                             executing in wavefront 32 mode.*
1789   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1790                                             when executing in wavefront 32
1791                                             mode.
1792   2304-2559      *Reserved*                 *Reserved for future Vector
1793                                             Accumulation Registers when
1794                                             executing in wavefront 32 mode.*
1795   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1796                                             when executing in wavefront 64
1797                                             mode.
1798   2816-3071      *Reserved*                 *Reserved for future Vector
1799                                             General Purpose Registers when
1800                                             executing in wavefront 64 mode.*
1801   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1802                                             when executing in wavefront 64
1803                                             mode.
1804   3328-3583      *Reserved*                 *Reserved for future Vector
1805                                             Accumulation Registers when
1806                                             executing in wavefront 64 mode.*
1807   ============== ================= ======== ==================================
1808
1809The vector registers are represented as the full size for the wavefront. They
1810are organized as consecutive dwords (32-bits), one per lane, with the dword at
1811the least significant bit position corresponding to lane 0 and so forth. DWARF
1812location expressions involving the ``DW_OP_LLVM_offset`` and
1813``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1814register corresponding to the lane that is executing the current thread of
1815execution in languages that are implemented using a SIMD or SIMT execution
1816model.
1817
1818If the wavefront size is 32 lanes then the wavefront 32 mode register
1819definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1820mode register definitions are used. Some AMDGPU targets support executing in
1821both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1822to the wavefront mode of the generated code will be used.
1823
1824If code is generated to execute in a 32-bit process address space, then the
182532-bit process address space register definitions are used. If code is generated
1826to execute in a 64-bit process address space, then the 64-bit process address
1827space register definitions are used. The ``amdgcn`` target only supports the
182864-bit process address space.
1829
1830.. _amdgpu-dwarf-address-class-identifier:
1831
1832Address Class Identifier
1833------------------------
1834
1835The DWARF address class represents the source language memory space. See DWARF
1836Version 5 section 2.12 which is updated by the *DWARF Extensions For
1837Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1838
1839The DWARF address class mapping used for AMDGPU is defined in
1840:ref:`amdgpu-dwarf-address-class-mapping-table`.
1841
1842.. table:: AMDGPU DWARF Address Class Mapping
1843   :name: amdgpu-dwarf-address-class-mapping-table
1844
1845   ========================= ====== =================
1846   DWARF                            AMDGPU
1847   -------------------------------- -----------------
1848   Address Class Name        Value  Address Space
1849   ========================= ====== =================
1850   ``DW_ADDR_none``          0x0000 Generic (Flat)
1851   ``DW_ADDR_LLVM_global``   0x0001 Global
1852   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1853   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1854   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1855   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1856   ========================= ====== =================
1857
1858The DWARF address class values defined in the *DWARF Extensions For
1859Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1860
1861In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1862available for use for the AMD extension for access to the hardware GDS memory
1863which is scratchpad memory allocated per device.
1864
1865For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1866address class of ``DW_ADDR_none`` is used.
1867
1868See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1869mapping of DWARF address classes to DWARF address spaces, including address size
1870and NULL value.
1871
1872.. _amdgpu-dwarf-address-space-identifier:
1873
1874Address Space Identifier
1875------------------------
1876
1877DWARF address spaces correspond to target architecture specific linear
1878addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1879For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1880
1881The DWARF address space mapping used for AMDGPU is defined in
1882:ref:`amdgpu-dwarf-address-space-mapping-table`.
1883
1884.. table:: AMDGPU DWARF Address Space Mapping
1885   :name: amdgpu-dwarf-address-space-mapping-table
1886
1887   ======================================= ===== ======= ======== ================= =======================
1888   DWARF                                                          AMDGPU            Notes
1889   --------------------------------------- ----- ---------------- ----------------- -----------------------
1890   Address Space Name                      Value Address Bit Size Address Space
1891   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1892   ..                                            64-bit  32-bit
1893                                                 process process
1894                                                 address address
1895                                                 space   space
1896   ======================================= ===== ======= ======== ================= =======================
1897   ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1898   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1899   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1900   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1901   *Reserved*                              0x04
1902   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1903   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1904   ======================================= ===== ======= ======== ================= =======================
1905
1906See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1907including address size and NULL value.
1908
1909The ``DW_ASPACE_none`` address space is the default target architecture address
1910space used in DWARF operations that do not specify an address space. It
1911therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1912related operations can refer to addresses in the program code.
1913
1914The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1915specify the flat address space. If the address corresponds to an address in the
1916local address space, then it corresponds to the wavefront that is executing the
1917focused thread of execution. If the address corresponds to an address in the
1918private address space, then it corresponds to the lane that is executing the
1919focused thread of execution for languages that are implemented using a SIMD or
1920SIMT execution model.
1921
1922.. note::
1923
1924  CUDA-like languages such as HIP that do not have address spaces in the
1925  language type system, but do allow variables to be allocated in different
1926  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1927  address space in the DWARF expression operations as the default address space
1928  is the global address space.
1929
1930The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1931specify the local address space corresponding to the wavefront that is executing
1932the focused thread of execution.
1933
1934The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1935to specify the private address space corresponding to the lane that is executing
1936the focused thread of execution for languages that are implemented using a SIMD
1937or SIMT execution model.
1938
1939The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1940to specify the unswizzled private address space corresponding to the wavefront
1941that is executing the focused thread of execution. The wavefront view of private
1942memory is the per wavefront unswizzled backing memory layout defined in
1943:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1944location for the backing memory of the wavefront (namely the address is not
1945offset by ``wavefront-scratch-base``). The following formula can be used to
1946convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1947``DW_ASPACE_AMDGPU_private_wave`` address:
1948
1949::
1950
1951  private-address-wavefront =
1952    ((private-address-lane / 4) * wavefront-size * 4) +
1953    (wavefront-lane-id * 4) + (private-address-lane % 4)
1954
1955If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1956of the dwords for each lane starting with lane 0 is required, then this
1957simplifies to:
1958
1959::
1960
1961  private-address-wavefront =
1962    private-address-lane * wavefront-size
1963
1964A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1965complete spilled vector register back into a complete vector register in the
1966CFI. The frame pointer can be a private lane address which is dword aligned,
1967which can be shifted to multiply by the wavefront size, and then used to form a
1968private wavefront address that gives a location for a contiguous set of dwords,
1969one per lane, where the vector register dwords are spilled. The compiler knows
1970the wavefront size since it generates the code. Note that the type of the
1971address may have to be converted as the size of a
1972``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1973``DW_ASPACE_AMDGPU_private_wave`` address.
1974
1975.. _amdgpu-dwarf-lane-identifier:
1976
1977Lane identifier
1978---------------
1979
1980DWARF lane identifies specify a target architecture lane position for hardware
1981that executes in a SIMD or SIMT manner, and on which a source language maps its
1982threads of execution onto those lanes. The DWARF lane identifier is pushed by
1983the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1984section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
1985section :ref:`amdgpu-dwarf-operation-expressions`.
1986
1987For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1988wavefront. It is numbered from 0 to the wavefront size minus 1.
1989
1990Operation Expressions
1991---------------------
1992
1993DWARF expressions are used to compute program values and the locations of
1994program objects. See DWARF Version 5 section 2.5 and
1995:ref:`amdgpu-dwarf-operation-expressions`.
1996
1997DWARF location descriptions describe how to access storage which includes memory
1998and registers. When accessing storage on AMDGPU, bytes are ordered with least
1999significant bytes first, and bits are ordered within bytes with least
2000significant bits first.
2001
2002For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2003unwinding vector registers that are spilled under the execution mask to memory:
2004the zero-single location description is the vector register, and the one-single
2005location description is the spilled memory location description. The
2006``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2007memory location description.
2008
2009In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2010``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2011controlled by the execution mask. An undefined location description together
2012with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2013to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2014
2015Debugger Information Entry Attributes
2016-------------------------------------
2017
2018This section describes how certain debugger information entry attributes are
2019used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2020which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2021:ref:`amdgpu-dwarf-low-level-information` and
2022:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2023
2024.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2025
2026``DW_AT_LLVM_lane_pc``
2027~~~~~~~~~~~~~~~~~~~~~~
2028
2029For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2030location of the separate lanes of a SIMT thread.
2031
2032If the lane is an active lane then this will be the same as the current program
2033location.
2034
2035If the lane is inactive, but was active on entry to the subprogram, then this is
2036the program location in the subprogram at which execution of the lane is
2037conceptual positioned.
2038
2039If the lane was not active on entry to the subprogram, then this will be the
2040undefined location. A client debugger can check if the lane is part of a valid
2041work-group by checking that the lane is in the range of the associated
2042work-group within the grid, accounting for partial work-groups. If it is not,
2043then the debugger can omit any information for the lane. Otherwise, the debugger
2044may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2045calling subprogram until it finds a non-undefined location. Conceptually the
2046lane only has the call frames that it has a non-undefined
2047``DW_AT_LLVM_lane_pc``.
2048
2049The following example illustrates how the AMDGPU backend can generate a DWARF
2050location list expression for the nested ``IF/THEN/ELSE`` structures of the
2051following subprogram pseudo code for a target with 64 lanes per wavefront.
2052
2053.. code::
2054  :number-lines:
2055
2056  SUBPROGRAM X
2057  BEGIN
2058    a;
2059    IF (c1) THEN
2060      b;
2061      IF (c2) THEN
2062        c;
2063      ELSE
2064        d;
2065      ENDIF
2066      e;
2067    ELSE
2068      f;
2069    ENDIF
2070    g;
2071  END
2072
2073The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2074execution mask (``EXEC``) to linearize the control flow. The condition is
2075evaluated to make a mask of the lanes for which the condition evaluates to true.
2076First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2077logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2078``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2079the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2080region the ``EXEC`` mask is restored to the value it had at the beginning of the
2081region. This is shown below. Other approaches are possible, but the basic
2082concept is the same.
2083
2084.. code::
2085  :number-lines:
2086
2087  $lex_start:
2088    a;
2089    %1 = EXEC
2090    %2 = c1
2091  $lex_1_start:
2092    EXEC = %1 & %2
2093  $if_1_then:
2094      b;
2095      %3 = EXEC
2096      %4 = c2
2097  $lex_1_1_start:
2098      EXEC = %3 & %4
2099  $lex_1_1_then:
2100        c;
2101      EXEC = ~EXEC & %3
2102  $lex_1_1_else:
2103        d;
2104      EXEC = %3
2105  $lex_1_1_end:
2106      e;
2107    EXEC = ~EXEC & %1
2108  $lex_1_else:
2109      f;
2110    EXEC = %1
2111  $lex_1_end:
2112    g;
2113  $lex_end:
2114
2115To create the DWARF location list expression that defines the location
2116description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2117pseudo instruction can be used to annotate the linearized control flow. This can
2118be done by defining an artificial variable for the lane PC. The DWARF location
2119list expression created for it is used as the value of the
2120``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2121
2122A DWARF procedure is defined for each well nested structured control flow region
2123which provides the conceptual lane program location for a lane if it is not
2124active (namely it is divergent). The DWARF operation expression for each region
2125conceptually inherits the value of the immediately enclosing region and modifies
2126it according to the semantics of the region.
2127
2128For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2129the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2130region the divergent program location is at the end of the ``IF/THEN/ELSE``
2131region since the ``THEN`` region has completed.
2132
2133The lane PC artificial variable is assigned at each region transition. It uses
2134the immediately enclosing region's DWARF procedure to compute the program
2135location for each lane assuming they are divergent, and then modifies the result
2136by inserting the current program location for each lane that the ``EXEC`` mask
2137indicates is active.
2138
2139By having separate DWARF procedures for each region, they can be reused to
2140define the value for any nested region. This reduces the total size of the DWARF
2141operation expressions.
2142
2143The following provides an example using pseudo LLVM MIR.
2144
2145.. code::
2146  :number-lines:
2147
2148  $lex_start:
2149    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2150      DW_AT_name = "__uint64";
2151      DW_AT_byte_size = 8;
2152      DW_AT_encoding = DW_ATE_unsigned;
2153    ];
2154    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2155      DW_AT_name = "__active_lane_pc";
2156      DW_AT_location = [
2157        DW_OP_regx PC;
2158        DW_OP_LLVM_extend 64, 64;
2159        DW_OP_regval_type EXEC, %uint_64;
2160        DW_OP_LLVM_select_bit_piece 64, 64;
2161      ];
2162    ];
2163    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2164      DW_AT_name = "__divergent_lane_pc";
2165      DW_AT_location = [
2166        DW_OP_LLVM_undefined;
2167        DW_OP_LLVM_extend 64, 64;
2168      ];
2169    ];
2170    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2171      DW_OP_call_ref %__divergent_lane_pc;
2172      DW_OP_call_ref %__active_lane_pc;
2173    ];
2174    a;
2175    %1 = EXEC;
2176    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2177    %2 = c1;
2178  $lex_1_start:
2179    EXEC = %1 & %2;
2180  $lex_1_then:
2181      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2182        DW_AT_name = "__divergent_lane_pc_1_then";
2183        DW_AT_location = DIExpression[
2184          DW_OP_call_ref %__divergent_lane_pc;
2185          DW_OP_addrx &lex_1_start;
2186          DW_OP_stack_value;
2187          DW_OP_LLVM_extend 64, 64;
2188          DW_OP_call_ref %__lex_1_save_exec;
2189          DW_OP_deref_type 64, %__uint_64;
2190          DW_OP_LLVM_select_bit_piece 64, 64;
2191        ];
2192      ];
2193      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2194        DW_OP_call_ref %__divergent_lane_pc_1_then;
2195        DW_OP_call_ref %__active_lane_pc;
2196      ];
2197      b;
2198      %3 = EXEC;
2199      DBG_VALUE %3, %__lex_1_1_save_exec;
2200      %4 = c2;
2201  $lex_1_1_start:
2202      EXEC = %3 & %4;
2203  $lex_1_1_then:
2204        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2205          DW_AT_name = "__divergent_lane_pc_1_1_then";
2206          DW_AT_location = DIExpression[
2207            DW_OP_call_ref %__divergent_lane_pc_1_then;
2208            DW_OP_addrx &lex_1_1_start;
2209            DW_OP_stack_value;
2210            DW_OP_LLVM_extend 64, 64;
2211            DW_OP_call_ref %__lex_1_1_save_exec;
2212            DW_OP_deref_type 64, %__uint_64;
2213            DW_OP_LLVM_select_bit_piece 64, 64;
2214          ];
2215        ];
2216        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2217          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2218          DW_OP_call_ref %__active_lane_pc;
2219        ];
2220        c;
2221      EXEC = ~EXEC & %3;
2222  $lex_1_1_else:
2223        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2224          DW_AT_name = "__divergent_lane_pc_1_1_else";
2225          DW_AT_location = DIExpression[
2226            DW_OP_call_ref %__divergent_lane_pc_1_then;
2227            DW_OP_addrx &lex_1_1_end;
2228            DW_OP_stack_value;
2229            DW_OP_LLVM_extend 64, 64;
2230            DW_OP_call_ref %__lex_1_1_save_exec;
2231            DW_OP_deref_type 64, %__uint_64;
2232            DW_OP_LLVM_select_bit_piece 64, 64;
2233          ];
2234        ];
2235        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2236          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2237          DW_OP_call_ref %__active_lane_pc;
2238        ];
2239        d;
2240      EXEC = %3;
2241  $lex_1_1_end:
2242      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2243        DW_OP_call_ref %__divergent_lane_pc;
2244        DW_OP_call_ref %__active_lane_pc;
2245      ];
2246      e;
2247    EXEC = ~EXEC & %1;
2248  $lex_1_else:
2249      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2250        DW_AT_name = "__divergent_lane_pc_1_else";
2251        DW_AT_location = DIExpression[
2252          DW_OP_call_ref %__divergent_lane_pc;
2253          DW_OP_addrx &lex_1_end;
2254          DW_OP_stack_value;
2255          DW_OP_LLVM_extend 64, 64;
2256          DW_OP_call_ref %__lex_1_save_exec;
2257          DW_OP_deref_type 64, %__uint_64;
2258          DW_OP_LLVM_select_bit_piece 64, 64;
2259        ];
2260      ];
2261      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2262        DW_OP_call_ref %__divergent_lane_pc_1_else;
2263        DW_OP_call_ref %__active_lane_pc;
2264      ];
2265      f;
2266    EXEC = %1;
2267  $lex_1_end:
2268    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2269      DW_OP_call_ref %__divergent_lane_pc;
2270      DW_OP_call_ref %__active_lane_pc;
2271    ];
2272    g;
2273  $lex_end:
2274
2275The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2276that are active, with the current program location.
2277
2278Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2279the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2280instruction, location list entries will be created that describe where the
2281artificial variables are allocated at any given program location. The compiler
2282may allocate them to registers or spill them to memory.
2283
2284The DWARF procedures for each region use the values of the saved execution mask
2285artificial variables to only update the lanes that are active on entry to the
2286region. All other lanes retain the value of the enclosing region where they were
2287last active. If they were not active on entry to the subprogram, then will have
2288the undefined location description.
2289
2290Other structured control flow regions can be handled similarly. For example,
2291loops would set the divergent program location for the region at the end of the
2292loop. Any lanes active will be in the loop, and any lanes not active must have
2293exited the loop.
2294
2295An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2296``IF/THEN/ELSE`` regions.
2297
2298The DWARF procedures can use the active lane artificial variable described in
2299:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2300``EXEC`` mask in order to support whole or quad wavefront mode.
2301
2302.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2303
2304``DW_AT_LLVM_active_lane``
2305~~~~~~~~~~~~~~~~~~~~~~~~~~
2306
2307The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2308entry is used to specify the lanes that are conceptually active for a SIMT
2309thread.
2310
2311The execution mask may be modified to implement whole or quad wavefront mode
2312operations. For example, all lanes may need to temporarily be made active to
2313execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2314update it to enable the necessary lanes, perform the operations, and then
2315restore the ``EXEC`` mask from the saved value. While executing the whole
2316wavefront region, the conceptual execution mask is the saved value, not the
2317``EXEC`` value.
2318
2319This is handled by defining an artificial variable for the active lane mask. The
2320active lane mask artificial variable would be the actual ``EXEC`` mask for
2321normal regions, and the saved execution mask for regions where the mask is
2322temporarily updated. The location list expression created for this artificial
2323variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2324attribute.
2325
2326``DW_AT_LLVM_augmentation``
2327~~~~~~~~~~~~~~~~~~~~~~~~~~~
2328
2329For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2330debugger information entry has the following value for the augmentation string:
2331
2332::
2333
2334  [amdgpu:v0.0]
2335
2336The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2337extensions used in the DWARF of the compilation unit. The version number
2338conforms to [SEMVER]_.
2339
2340Call Frame Information
2341----------------------
2342
2343DWARF Call Frame Information (CFI) describes how a consumer can virtually
2344*unwind* call frames in a running process or core dump. See DWARF Version 5
2345section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2346
2347For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2348
23491.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2350
2351    ::
2352
2353      [amd:v0.0]
2354
2355    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2356    extensions used in this CIE or to the FDEs that use it. The version number
2357    conforms to [SEMVER]_.
2358
23592.  ``address_size`` for the ``Global`` address space is defined in
2360    :ref:`amdgpu-dwarf-address-space-identifier`.
2361
23623.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2363
23644.  ``code_alignment_factor`` is 4 bytes.
2365
2366    .. TODO::
2367
2368       Add to :ref:`amdgpu-processor-table` table.
2369
23705.  ``data_alignment_factor`` is 4 bytes.
2371
2372    .. TODO::
2373
2374       Add to :ref:`amdgpu-processor-table` table.
2375
23766.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2377    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2378
23797.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2380    called from subprogram Y that has more allocated, X will not change any of
2381    the extra registers as it cannot access them. Therefore, the default rule
2382    for all columns is ``same value``.
2383
2384For AMDGPU the register number follows the numbering defined in
2385:ref:`amdgpu-dwarf-register-identifier`.
2386
2387For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2388the return address to get the address of a byte within the call site
2389instructions. See DWARF Version 5 section 6.4.4.
2390
2391Accelerated Access
2392------------------
2393
2394See DWARF Version 5 section 6.1.
2395
2396Lookup By Name Section Header
2397~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2398
2399See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2400
2401For AMDGPU the lookup by name section header table:
2402
2403``augmentation_string_size`` (uword)
2404
2405  Set to the length of the ``augmentation_string`` value which is always a
2406  multiple of 4.
2407
2408``augmentation_string`` (sequence of UTF-8 characters)
2409
2410  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2411
2412  ::
2413
2414    [amdgpu:v0.0]
2415
2416  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2417  extensions used in the DWARF of this index. The version number conforms to
2418  [SEMVER]_.
2419
2420  .. note::
2421
2422    This is different to the DWARF Version 5 definition that requires the first
2423    4 characters to be the vendor ID. But this is consistent with the other
2424    augmentation strings and does allow multiple vendor contributions. However,
2425    backwards compatibility may be more desirable.
2426
2427Lookup By Address Section Header
2428~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2429
2430See DWARF Version 5 section 6.1.2.
2431
2432For AMDGPU the lookup by address section header table:
2433
2434``address_size`` (ubyte)
2435
2436  Match the address size for the ``Global`` address space defined in
2437  :ref:`amdgpu-dwarf-address-space-identifier`.
2438
2439``segment_selector_size`` (ubyte)
2440
2441  AMDGPU does not use a segment selector so this is 0. The entries in the
2442  ``.debug_aranges`` do not have a segment selector.
2443
2444Line Number Information
2445-----------------------
2446
2447See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2448
2449AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2450The instruction set must be obtained from the ELF file header ``e_flags`` field
2451in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2452<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2453
2454.. TODO::
2455
2456  Should the ``isa`` state machine register be used to indicate if the code is
2457  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2458
2459For AMDGPU the line number program header fields have the following values (see
2460DWARF Version 5 section 6.2.4):
2461
2462``address_size`` (ubyte)
2463  Matches the address size for the ``Global`` address space defined in
2464  :ref:`amdgpu-dwarf-address-space-identifier`.
2465
2466``segment_selector_size`` (ubyte)
2467  AMDGPU does not use a segment selector so this is 0.
2468
2469``minimum_instruction_length`` (ubyte)
2470  For GFX9-GFX10 this is 4.
2471
2472``maximum_operations_per_instruction`` (ubyte)
2473  For GFX9-GFX10 this is 1.
2474
2475Source text for online-compiled programs (for example, those compiled by the
2476OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2477See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2478Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2479<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2480
2481The Clang option used to control source embedding in AMDGPU is defined in
2482:ref:`amdgpu-clang-debug-options-table`.
2483
2484  .. table:: AMDGPU Clang Debug Options
2485     :name: amdgpu-clang-debug-options-table
2486
2487     ==================== ==================================================
2488     Debug Flag           Description
2489     ==================== ==================================================
2490     -g[no-]embed-source  Enable/disable embedding source text in DWARF
2491                          debug sections. Useful for environments where
2492                          source cannot be written to disk, such as
2493                          when performing online compilation.
2494     ==================== ==================================================
2495
2496For example:
2497
2498``-gembed-source``
2499  Enable the embedded source.
2500
2501``-gno-embed-source``
2502  Disable the embedded source.
2503
250432-Bit and 64-Bit DWARF Formats
2505-------------------------------
2506
2507See DWARF Version 5 section 7.4 and
2508:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2509
2510For AMDGPU:
2511
2512* For the ``amdgcn`` target architecture only the 64-bit process address space
2513  is supported.
2514
2515* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2516  the 32-bit DWARF format.
2517
2518Unit Headers
2519------------
2520
2521For AMDGPU the following values apply for each of the unit headers described in
2522DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2523
2524``address_size`` (ubyte)
2525  Matches the address size for the ``Global`` address space defined in
2526  :ref:`amdgpu-dwarf-address-space-identifier`.
2527
2528.. _amdgpu-code-conventions:
2529
2530Code Conventions
2531================
2532
2533This section provides code conventions used for each supported target triple OS
2534(see :ref:`amdgpu-target-triples`).
2535
2536AMDHSA
2537------
2538
2539This section provides code conventions used when the target triple OS is
2540``amdhsa`` (see :ref:`amdgpu-target-triples`).
2541
2542.. _amdgpu-amdhsa-code-object-metadata:
2543
2544Code Object Metadata
2545~~~~~~~~~~~~~~~~~~~~
2546
2547The code object metadata specifies extensible metadata associated with the code
2548objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2549encoding and semantics of this metadata depends on the code object version; see
2550:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2551:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and
2552:ref:`amdgpu-amdhsa-code-object-metadata-v4`.
2553
2554Code object metadata is specified in a note record (see
2555:ref:`amdgpu-note-records`) and is required when the target triple OS is
2556``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2557information necessary to support the HSA compatible runtime kernel queries. For
2558example, the segment sizes needed in a dispatch packet. In addition, a
2559high-level language runtime may require other information to be included. For
2560example, the AMD OpenCL runtime records kernel argument information.
2561
2562.. _amdgpu-amdhsa-code-object-metadata-v2:
2563
2564Code Object V2 Metadata
2565+++++++++++++++++++++++
2566
2567.. warning::
2568  Code object V2 is not the default code object version emitted by this version
2569  of LLVM.
2570
2571Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2572(see :ref:`amdgpu-note-records-v2`).
2573
2574The metadata is specified as a YAML formatted string (see [YAML]_ and
2575:doc:`YamlIO`).
2576
2577.. TODO::
2578
2579  Is the string null terminated? It probably should not if YAML allows it to
2580  contain null characters, otherwise it should be.
2581
2582The metadata is represented as a single YAML document comprised of the mapping
2583defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2584referenced tables.
2585
2586For boolean values, the string values of ``false`` and ``true`` are used for
2587false and true respectively.
2588
2589Additional information can be added to the mappings. To avoid conflicts, any
2590non-AMD key names should be prefixed by "*vendor-name*.".
2591
2592  .. table:: AMDHSA Code Object V2 Metadata Map
2593     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2594
2595     ========== ============== ========= =======================================
2596     String Key Value Type     Required? Description
2597     ========== ============== ========= =======================================
2598     "Version"  sequence of    Required  - The first integer is the major
2599                2 integers                 version. Currently 1.
2600                                         - The second integer is the minor
2601                                           version. Currently 0.
2602     "Printf"   sequence of              Each string is encoded information
2603                strings                  about a printf function call. The
2604                                         encoded information is organized as
2605                                         fields separated by colon (':'):
2606
2607                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2608
2609                                         where:
2610
2611                                         ``ID``
2612                                           A 32-bit integer as a unique id for
2613                                           each printf function call
2614
2615                                         ``N``
2616                                           A 32-bit integer equal to the number
2617                                           of arguments of printf function call
2618                                           minus 1
2619
2620                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2621                                           32-bit integers for the size in bytes
2622                                           of the i-th FormatString argument of
2623                                           the printf function call
2624
2625                                         FormatString
2626                                           The format string passed to the
2627                                           printf function call.
2628     "Kernels"  sequence of    Required  Sequence of the mappings for each
2629                mapping                  kernel in the code object. See
2630                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2631                                         for the definition of the mapping.
2632     ========== ============== ========= =======================================
2633
2634..
2635
2636  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2637     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2638
2639     ================= ============== ========= ================================
2640     String Key        Value Type     Required? Description
2641     ================= ============== ========= ================================
2642     "Name"            string         Required  Source name of the kernel.
2643     "SymbolName"      string         Required  Name of the kernel
2644                                                descriptor ELF symbol.
2645     "Language"        string                   Source language of the kernel.
2646                                                Values include:
2647
2648                                                - "OpenCL C"
2649                                                - "OpenCL C++"
2650                                                - "HCC"
2651                                                - "OpenMP"
2652
2653     "LanguageVersion" sequence of              - The first integer is the major
2654                       2 integers                 version.
2655                                                - The second integer is the
2656                                                  minor version.
2657     "Attrs"           mapping                  Mapping of kernel attributes.
2658                                                See
2659                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2660                                                for the mapping definition.
2661     "Args"            sequence of              Sequence of mappings of the
2662                       mapping                  kernel arguments. See
2663                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2664                                                for the definition of the mapping.
2665     "CodeProps"       mapping                  Mapping of properties related to
2666                                                the kernel code. See
2667                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2668                                                for the mapping definition.
2669     ================= ============== ========= ================================
2670
2671..
2672
2673  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2674     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2675
2676     =================== ============== ========= ==============================
2677     String Key          Value Type     Required? Description
2678     =================== ============== ========= ==============================
2679     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2680                         3 integers               must be >=1 and the dispatch
2681                                                  work-group size X, Y, Z must
2682                                                  correspond to the specified
2683                                                  values. Defaults to 0, 0, 0.
2684
2685                                                  Corresponds to the OpenCL
2686                                                  ``reqd_work_group_size``
2687                                                  attribute.
2688     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2689                         3 integers               X, Y, Z is likely to be the
2690                                                  specified values.
2691
2692                                                  Corresponds to the OpenCL
2693                                                  ``work_group_size_hint``
2694                                                  attribute.
2695     "VecTypeHint"       string                   The name of a scalar or vector
2696                                                  type.
2697
2698                                                  Corresponds to the OpenCL
2699                                                  ``vec_type_hint`` attribute.
2700
2701     "RuntimeHandle"     string                   The external symbol name
2702                                                  associated with a kernel.
2703                                                  OpenCL runtime allocates a
2704                                                  global buffer for the symbol
2705                                                  and saves the kernel's address
2706                                                  to it, which is used for
2707                                                  device side enqueueing. Only
2708                                                  available for device side
2709                                                  enqueued kernels.
2710     =================== ============== ========= ==============================
2711
2712..
2713
2714  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2715     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2716
2717     ================= ============== ========= ================================
2718     String Key        Value Type     Required? Description
2719     ================= ============== ========= ================================
2720     "Name"            string                   Kernel argument name.
2721     "TypeName"        string                   Kernel argument type name.
2722     "Size"            integer        Required  Kernel argument size in bytes.
2723     "Align"           integer        Required  Kernel argument alignment in
2724                                                bytes. Must be a power of two.
2725     "ValueKind"       string         Required  Kernel argument kind that
2726                                                specifies how to set up the
2727                                                corresponding argument.
2728                                                Values include:
2729
2730                                                "ByValue"
2731                                                  The argument is copied
2732                                                  directly into the kernarg.
2733
2734                                                "GlobalBuffer"
2735                                                  A global address space pointer
2736                                                  to the buffer data is passed
2737                                                  in the kernarg.
2738
2739                                                "DynamicSharedPointer"
2740                                                  A group address space pointer
2741                                                  to dynamically allocated LDS
2742                                                  is passed in the kernarg.
2743
2744                                                "Sampler"
2745                                                  A global address space
2746                                                  pointer to a S# is passed in
2747                                                  the kernarg.
2748
2749                                                "Image"
2750                                                  A global address space
2751                                                  pointer to a T# is passed in
2752                                                  the kernarg.
2753
2754                                                "Pipe"
2755                                                  A global address space pointer
2756                                                  to an OpenCL pipe is passed in
2757                                                  the kernarg.
2758
2759                                                "Queue"
2760                                                  A global address space pointer
2761                                                  to an OpenCL device enqueue
2762                                                  queue is passed in the
2763                                                  kernarg.
2764
2765                                                "HiddenGlobalOffsetX"
2766                                                  The OpenCL grid dispatch
2767                                                  global offset for the X
2768                                                  dimension is passed in the
2769                                                  kernarg.
2770
2771                                                "HiddenGlobalOffsetY"
2772                                                  The OpenCL grid dispatch
2773                                                  global offset for the Y
2774                                                  dimension is passed in the
2775                                                  kernarg.
2776
2777                                                "HiddenGlobalOffsetZ"
2778                                                  The OpenCL grid dispatch
2779                                                  global offset for the Z
2780                                                  dimension is passed in the
2781                                                  kernarg.
2782
2783                                                "HiddenNone"
2784                                                  An argument that is not used
2785                                                  by the kernel. Space needs to
2786                                                  be left for it, but it does
2787                                                  not need to be set up.
2788
2789                                                "HiddenPrintfBuffer"
2790                                                  A global address space pointer
2791                                                  to the runtime printf buffer
2792                                                  is passed in kernarg.
2793
2794                                                "HiddenHostcallBuffer"
2795                                                  A global address space pointer
2796                                                  to the runtime hostcall buffer
2797                                                  is passed in kernarg.
2798
2799                                                "HiddenDefaultQueue"
2800                                                  A global address space pointer
2801                                                  to the OpenCL device enqueue
2802                                                  queue that should be used by
2803                                                  the kernel by default is
2804                                                  passed in the kernarg.
2805
2806                                                "HiddenCompletionAction"
2807                                                  A global address space pointer
2808                                                  to help link enqueued kernels into
2809                                                  the ancestor tree for determining
2810                                                  when the parent kernel has finished.
2811
2812                                                "HiddenMultiGridSyncArg"
2813                                                  A global address space pointer for
2814                                                  multi-grid synchronization is
2815                                                  passed in the kernarg.
2816
2817     "ValueType"       string                   Unused and deprecated. This should no longer
2818                                                be emitted, but is accepted for compatibility.
2819
2820
2821     "PointeeAlign"    integer                  Alignment in bytes of pointee
2822                                                type for pointer type kernel
2823                                                argument. Must be a power
2824                                                of 2. Only present if
2825                                                "ValueKind" is
2826                                                "DynamicSharedPointer".
2827     "AddrSpaceQual"   string                   Kernel argument address space
2828                                                qualifier. Only present if
2829                                                "ValueKind" is "GlobalBuffer" or
2830                                                "DynamicSharedPointer". Values
2831                                                are:
2832
2833                                                - "Private"
2834                                                - "Global"
2835                                                - "Constant"
2836                                                - "Local"
2837                                                - "Generic"
2838                                                - "Region"
2839
2840                                                .. TODO::
2841
2842                                                   Is GlobalBuffer only Global
2843                                                   or Constant? Is
2844                                                   DynamicSharedPointer always
2845                                                   Local? Can HCC allow Generic?
2846                                                   How can Private or Region
2847                                                   ever happen?
2848
2849     "AccQual"         string                   Kernel argument access
2850                                                qualifier. Only present if
2851                                                "ValueKind" is "Image" or
2852                                                "Pipe". Values
2853                                                are:
2854
2855                                                - "ReadOnly"
2856                                                - "WriteOnly"
2857                                                - "ReadWrite"
2858
2859                                                .. TODO::
2860
2861                                                   Does this apply to
2862                                                   GlobalBuffer?
2863
2864     "ActualAccQual"   string                   The actual memory accesses
2865                                                performed by the kernel on the
2866                                                kernel argument. Only present if
2867                                                "ValueKind" is "GlobalBuffer",
2868                                                "Image", or "Pipe". This may be
2869                                                more restrictive than indicated
2870                                                by "AccQual" to reflect what the
2871                                                kernel actual does. If not
2872                                                present then the runtime must
2873                                                assume what is implied by
2874                                                "AccQual" and "IsConst". Values
2875                                                are:
2876
2877                                                - "ReadOnly"
2878                                                - "WriteOnly"
2879                                                - "ReadWrite"
2880
2881     "IsConst"         boolean                  Indicates if the kernel argument
2882                                                is const qualified. Only present
2883                                                if "ValueKind" is
2884                                                "GlobalBuffer".
2885
2886     "IsRestrict"      boolean                  Indicates if the kernel argument
2887                                                is restrict qualified. Only
2888                                                present if "ValueKind" is
2889                                                "GlobalBuffer".
2890
2891     "IsVolatile"      boolean                  Indicates if the kernel argument
2892                                                is volatile qualified. Only
2893                                                present if "ValueKind" is
2894                                                "GlobalBuffer".
2895
2896     "IsPipe"          boolean                  Indicates if the kernel argument
2897                                                is pipe qualified. Only present
2898                                                if "ValueKind" is "Pipe".
2899
2900                                                .. TODO::
2901
2902                                                   Can GlobalBuffer be pipe
2903                                                   qualified?
2904
2905     ================= ============== ========= ================================
2906
2907..
2908
2909  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2910     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2911
2912     ============================ ============== ========= =====================
2913     String Key                   Value Type     Required? Description
2914     ============================ ============== ========= =====================
2915     "KernargSegmentSize"         integer        Required  The size in bytes of
2916                                                           the kernarg segment
2917                                                           that holds the values
2918                                                           of the arguments to
2919                                                           the kernel.
2920     "GroupSegmentFixedSize"      integer        Required  The amount of group
2921                                                           segment memory
2922                                                           required by a
2923                                                           work-group in
2924                                                           bytes. This does not
2925                                                           include any
2926                                                           dynamically allocated
2927                                                           group segment memory
2928                                                           that may be added
2929                                                           when the kernel is
2930                                                           dispatched.
2931     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
2932                                                           private address space
2933                                                           memory required for a
2934                                                           work-item in
2935                                                           bytes. If the kernel
2936                                                           uses a dynamic call
2937                                                           stack then additional
2938                                                           space must be added
2939                                                           to this value for the
2940                                                           call stack.
2941     "KernargSegmentAlign"        integer        Required  The maximum byte
2942                                                           alignment of
2943                                                           arguments in the
2944                                                           kernarg segment. Must
2945                                                           be a power of 2.
2946     "WavefrontSize"              integer        Required  Wavefront size. Must
2947                                                           be a power of 2.
2948     "NumSGPRs"                   integer        Required  Number of scalar
2949                                                           registers used by a
2950                                                           wavefront for
2951                                                           GFX6-GFX10. This
2952                                                           includes the special
2953                                                           SGPRs for VCC, Flat
2954                                                           Scratch (GFX7-GFX10)
2955                                                           and XNACK (for
2956                                                           GFX8-GFX10). It does
2957                                                           not include the 16
2958                                                           SGPR added if a trap
2959                                                           handler is
2960                                                           enabled. It is not
2961                                                           rounded up to the
2962                                                           allocation
2963                                                           granularity.
2964     "NumVGPRs"                   integer        Required  Number of vector
2965                                                           registers used by
2966                                                           each work-item for
2967                                                           GFX6-GFX10
2968     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
2969                                                           work-group size
2970                                                           supported by the
2971                                                           kernel in work-items.
2972                                                           Must be >=1 and
2973                                                           consistent with
2974                                                           ReqdWorkGroupSize if
2975                                                           not 0, 0, 0.
2976     "NumSpilledSGPRs"            integer                  Number of stores from
2977                                                           a scalar register to
2978                                                           a register allocator
2979                                                           created spill
2980                                                           location.
2981     "NumSpilledVGPRs"            integer                  Number of stores from
2982                                                           a vector register to
2983                                                           a register allocator
2984                                                           created spill
2985                                                           location.
2986     ============================ ============== ========= =====================
2987
2988.. _amdgpu-amdhsa-code-object-metadata-v3:
2989
2990Code Object V3 Metadata
2991+++++++++++++++++++++++
2992
2993Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note
2994record (see :ref:`amdgpu-note-records-v3-v4`).
2995
2996The metadata is represented as Message Pack formatted binary data (see
2997[MsgPack]_). The top level is a Message Pack map that includes the
2998keys defined in table
2999:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3000tables.
3001
3002Additional information can be added to the maps. To avoid conflicts,
3003any key names should be prefixed by "*vendor-name*." where
3004``vendor-name`` can be the name of the vendor and specific vendor
3005tool that generates the information. The prefix is abbreviated to
3006simply "." when it appears within a map that has been added by the
3007same *vendor-name*.
3008
3009  .. table:: AMDHSA Code Object V3 Metadata Map
3010     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3011
3012     ================= ============== ========= =======================================
3013     String Key        Value Type     Required? Description
3014     ================= ============== ========= =======================================
3015     "amdhsa.version"  sequence of    Required  - The first integer is the major
3016                       2 integers                 version. Currently 1.
3017                                                - The second integer is the minor
3018                                                  version. Currently 0.
3019     "amdhsa.printf"   sequence of              Each string is encoded information
3020                       strings                  about a printf function call. The
3021                                                encoded information is organized as
3022                                                fields separated by colon (':'):
3023
3024                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3025
3026                                                where:
3027
3028                                                ``ID``
3029                                                  A 32-bit integer as a unique id for
3030                                                  each printf function call
3031
3032                                                ``N``
3033                                                  A 32-bit integer equal to the number
3034                                                  of arguments of printf function call
3035                                                  minus 1
3036
3037                                                ``S[i]`` (where i = 0, 1, ... , N-1)
3038                                                  32-bit integers for the size in bytes
3039                                                  of the i-th FormatString argument of
3040                                                  the printf function call
3041
3042                                                FormatString
3043                                                  The format string passed to the
3044                                                  printf function call.
3045     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3046                       map                      kernel in the code object. See
3047                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3048                                                for the definition of the keys included
3049                                                in that map.
3050     ================= ============== ========= =======================================
3051
3052..
3053
3054  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3055     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3056
3057     =================================== ============== ========= ================================
3058     String Key                          Value Type     Required? Description
3059     =================================== ============== ========= ================================
3060     ".name"                             string         Required  Source name of the kernel.
3061     ".symbol"                           string         Required  Name of the kernel
3062                                                                  descriptor ELF symbol.
3063     ".language"                         string                   Source language of the kernel.
3064                                                                  Values include:
3065
3066                                                                  - "OpenCL C"
3067                                                                  - "OpenCL C++"
3068                                                                  - "HCC"
3069                                                                  - "HIP"
3070                                                                  - "OpenMP"
3071                                                                  - "Assembler"
3072
3073     ".language_version"                 sequence of              - The first integer is the major
3074                                         2 integers                 version.
3075                                                                  - The second integer is the
3076                                                                    minor version.
3077     ".args"                             sequence of              Sequence of maps of the
3078                                         map                      kernel arguments. See
3079                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3080                                                                  for the definition of the keys
3081                                                                  included in that map.
3082     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3083                                         3 integers               must be >=1 and the dispatch
3084                                                                  work-group size X, Y, Z must
3085                                                                  correspond to the specified
3086                                                                  values. Defaults to 0, 0, 0.
3087
3088                                                                  Corresponds to the OpenCL
3089                                                                  ``reqd_work_group_size``
3090                                                                  attribute.
3091     ".workgroup_size_hint"              sequence of              The dispatch work-group size
3092                                         3 integers               X, Y, Z is likely to be the
3093                                                                  specified values.
3094
3095                                                                  Corresponds to the OpenCL
3096                                                                  ``work_group_size_hint``
3097                                                                  attribute.
3098     ".vec_type_hint"                    string                   The name of a scalar or vector
3099                                                                  type.
3100
3101                                                                  Corresponds to the OpenCL
3102                                                                  ``vec_type_hint`` attribute.
3103
3104     ".device_enqueue_symbol"            string                   The external symbol name
3105                                                                  associated with a kernel.
3106                                                                  OpenCL runtime allocates a
3107                                                                  global buffer for the symbol
3108                                                                  and saves the kernel's address
3109                                                                  to it, which is used for
3110                                                                  device side enqueueing. Only
3111                                                                  available for device side
3112                                                                  enqueued kernels.
3113     ".kernarg_segment_size"             integer        Required  The size in bytes of
3114                                                                  the kernarg segment
3115                                                                  that holds the values
3116                                                                  of the arguments to
3117                                                                  the kernel.
3118     ".group_segment_fixed_size"         integer        Required  The amount of group
3119                                                                  segment memory
3120                                                                  required by a
3121                                                                  work-group in
3122                                                                  bytes. This does not
3123                                                                  include any
3124                                                                  dynamically allocated
3125                                                                  group segment memory
3126                                                                  that may be added
3127                                                                  when the kernel is
3128                                                                  dispatched.
3129     ".private_segment_fixed_size"       integer        Required  The amount of fixed
3130                                                                  private address space
3131                                                                  memory required for a
3132                                                                  work-item in
3133                                                                  bytes. If the kernel
3134                                                                  uses a dynamic call
3135                                                                  stack then additional
3136                                                                  space must be added
3137                                                                  to this value for the
3138                                                                  call stack.
3139     ".kernarg_segment_align"            integer        Required  The maximum byte
3140                                                                  alignment of
3141                                                                  arguments in the
3142                                                                  kernarg segment. Must
3143                                                                  be a power of 2.
3144     ".wavefront_size"                   integer        Required  Wavefront size. Must
3145                                                                  be a power of 2.
3146     ".sgpr_count"                       integer        Required  Number of scalar
3147                                                                  registers required by a
3148                                                                  wavefront for
3149                                                                  GFX6-GFX9. A register
3150                                                                  is required if it is
3151                                                                  used explicitly, or
3152                                                                  if a higher numbered
3153                                                                  register is used
3154                                                                  explicitly. This
3155                                                                  includes the special
3156                                                                  SGPRs for VCC, Flat
3157                                                                  Scratch (GFX7-GFX9)
3158                                                                  and XNACK (for
3159                                                                  GFX8-GFX9). It does
3160                                                                  not include the 16
3161                                                                  SGPR added if a trap
3162                                                                  handler is
3163                                                                  enabled. It is not
3164                                                                  rounded up to the
3165                                                                  allocation
3166                                                                  granularity.
3167     ".vgpr_count"                       integer        Required  Number of vector
3168                                                                  registers required by
3169                                                                  each work-item for
3170                                                                  GFX6-GFX9. A register
3171                                                                  is required if it is
3172                                                                  used explicitly, or
3173                                                                  if a higher numbered
3174                                                                  register is used
3175                                                                  explicitly.
3176     ".max_flat_workgroup_size"          integer        Required  Maximum flat
3177                                                                  work-group size
3178                                                                  supported by the
3179                                                                  kernel in work-items.
3180                                                                  Must be >=1 and
3181                                                                  consistent with
3182                                                                  ReqdWorkGroupSize if
3183                                                                  not 0, 0, 0.
3184     ".sgpr_spill_count"                 integer                  Number of stores from
3185                                                                  a scalar register to
3186                                                                  a register allocator
3187                                                                  created spill
3188                                                                  location.
3189     ".vgpr_spill_count"                 integer                  Number of stores from
3190                                                                  a vector register to
3191                                                                  a register allocator
3192                                                                  created spill
3193                                                                  location.
3194     ".kind"                             string                   The kind of the kernel
3195                                                                  with the following
3196                                                                  values:
3197
3198                                                                  "normal"
3199                                                                    Regular kernels.
3200
3201                                                                  "init"
3202                                                                    These kernels must be
3203                                                                    invoked after loading
3204                                                                    the containing code
3205                                                                    object and must
3206                                                                    complete before any
3207                                                                    normal and fini
3208                                                                    kernels in the same
3209                                                                    code object are
3210                                                                    invoked.
3211
3212                                                                  "fini"
3213                                                                    These kernels must be
3214                                                                    invoked before
3215                                                                    unloading the
3216                                                                    containing code object
3217                                                                    and after all init and
3218                                                                    normal kernels in the
3219                                                                    same code object have
3220                                                                    been invoked and
3221                                                                    completed.
3222
3223                                                                  If omitted, "normal" is
3224                                                                  assumed.
3225     =================================== ============== ========= ================================
3226
3227..
3228
3229  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3230     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3231
3232     ====================== ============== ========= ================================
3233     String Key             Value Type     Required? Description
3234     ====================== ============== ========= ================================
3235     ".name"                string                   Kernel argument name.
3236     ".type_name"           string                   Kernel argument type name.
3237     ".size"                integer        Required  Kernel argument size in bytes.
3238     ".offset"              integer        Required  Kernel argument offset in
3239                                                     bytes. The offset must be a
3240                                                     multiple of the alignment
3241                                                     required by the argument.
3242     ".value_kind"          string         Required  Kernel argument kind that
3243                                                     specifies how to set up the
3244                                                     corresponding argument.
3245                                                     Values include:
3246
3247                                                     "by_value"
3248                                                       The argument is copied
3249                                                       directly into the kernarg.
3250
3251                                                     "global_buffer"
3252                                                       A global address space pointer
3253                                                       to the buffer data is passed
3254                                                       in the kernarg.
3255
3256                                                     "dynamic_shared_pointer"
3257                                                       A group address space pointer
3258                                                       to dynamically allocated LDS
3259                                                       is passed in the kernarg.
3260
3261                                                     "sampler"
3262                                                       A global address space
3263                                                       pointer to a S# is passed in
3264                                                       the kernarg.
3265
3266                                                     "image"
3267                                                       A global address space
3268                                                       pointer to a T# is passed in
3269                                                       the kernarg.
3270
3271                                                     "pipe"
3272                                                       A global address space pointer
3273                                                       to an OpenCL pipe is passed in
3274                                                       the kernarg.
3275
3276                                                     "queue"
3277                                                       A global address space pointer
3278                                                       to an OpenCL device enqueue
3279                                                       queue is passed in the
3280                                                       kernarg.
3281
3282                                                     "hidden_global_offset_x"
3283                                                       The OpenCL grid dispatch
3284                                                       global offset for the X
3285                                                       dimension is passed in the
3286                                                       kernarg.
3287
3288                                                     "hidden_global_offset_y"
3289                                                       The OpenCL grid dispatch
3290                                                       global offset for the Y
3291                                                       dimension is passed in the
3292                                                       kernarg.
3293
3294                                                     "hidden_global_offset_z"
3295                                                       The OpenCL grid dispatch
3296                                                       global offset for the Z
3297                                                       dimension is passed in the
3298                                                       kernarg.
3299
3300                                                     "hidden_none"
3301                                                       An argument that is not used
3302                                                       by the kernel. Space needs to
3303                                                       be left for it, but it does
3304                                                       not need to be set up.
3305
3306                                                     "hidden_printf_buffer"
3307                                                       A global address space pointer
3308                                                       to the runtime printf buffer
3309                                                       is passed in kernarg.
3310
3311                                                     "hidden_hostcall_buffer"
3312                                                       A global address space pointer
3313                                                       to the runtime hostcall buffer
3314                                                       is passed in kernarg.
3315
3316                                                     "hidden_default_queue"
3317                                                       A global address space pointer
3318                                                       to the OpenCL device enqueue
3319                                                       queue that should be used by
3320                                                       the kernel by default is
3321                                                       passed in the kernarg.
3322
3323                                                     "hidden_completion_action"
3324                                                       A global address space pointer
3325                                                       to help link enqueued kernels into
3326                                                       the ancestor tree for determining
3327                                                       when the parent kernel has finished.
3328
3329                                                     "hidden_multigrid_sync_arg"
3330                                                       A global address space pointer for
3331                                                       multi-grid synchronization is
3332                                                       passed in the kernarg.
3333
3334     ".value_type"          string                    Unused and deprecated. This should no longer
3335                                                      be emitted, but is accepted for compatibility.
3336
3337     ".pointee_align"       integer                  Alignment in bytes of pointee
3338                                                     type for pointer type kernel
3339                                                     argument. Must be a power
3340                                                     of 2. Only present if
3341                                                     ".value_kind" is
3342                                                     "dynamic_shared_pointer".
3343     ".address_space"       string                   Kernel argument address space
3344                                                     qualifier. Only present if
3345                                                     ".value_kind" is "global_buffer" or
3346                                                     "dynamic_shared_pointer". Values
3347                                                     are:
3348
3349                                                     - "private"
3350                                                     - "global"
3351                                                     - "constant"
3352                                                     - "local"
3353                                                     - "generic"
3354                                                     - "region"
3355
3356                                                     .. TODO::
3357
3358                                                        Is "global_buffer" only "global"
3359                                                        or "constant"? Is
3360                                                        "dynamic_shared_pointer" always
3361                                                        "local"? Can HCC allow "generic"?
3362                                                        How can "private" or "region"
3363                                                        ever happen?
3364
3365     ".access"              string                   Kernel argument access
3366                                                     qualifier. Only present if
3367                                                     ".value_kind" is "image" or
3368                                                     "pipe". Values
3369                                                     are:
3370
3371                                                     - "read_only"
3372                                                     - "write_only"
3373                                                     - "read_write"
3374
3375                                                     .. TODO::
3376
3377                                                        Does this apply to
3378                                                        "global_buffer"?
3379
3380     ".actual_access"       string                   The actual memory accesses
3381                                                     performed by the kernel on the
3382                                                     kernel argument. Only present if
3383                                                     ".value_kind" is "global_buffer",
3384                                                     "image", or "pipe". This may be
3385                                                     more restrictive than indicated
3386                                                     by ".access" to reflect what the
3387                                                     kernel actual does. If not
3388                                                     present then the runtime must
3389                                                     assume what is implied by
3390                                                     ".access" and ".is_const"      . Values
3391                                                     are:
3392
3393                                                     - "read_only"
3394                                                     - "write_only"
3395                                                     - "read_write"
3396
3397     ".is_const"            boolean                  Indicates if the kernel argument
3398                                                     is const qualified. Only present
3399                                                     if ".value_kind" is
3400                                                     "global_buffer".
3401
3402     ".is_restrict"         boolean                  Indicates if the kernel argument
3403                                                     is restrict qualified. Only
3404                                                     present if ".value_kind" is
3405                                                     "global_buffer".
3406
3407     ".is_volatile"         boolean                  Indicates if the kernel argument
3408                                                     is volatile qualified. Only
3409                                                     present if ".value_kind" is
3410                                                     "global_buffer".
3411
3412     ".is_pipe"             boolean                  Indicates if the kernel argument
3413                                                     is pipe qualified. Only present
3414                                                     if ".value_kind" is "pipe".
3415
3416                                                     .. TODO::
3417
3418                                                        Can "global_buffer" be pipe
3419                                                        qualified?
3420
3421     ====================== ============== ========= ================================
3422
3423.. _amdgpu-amdhsa-code-object-metadata-v4:
3424
3425Code Object V4 Metadata
3426+++++++++++++++++++++++
3427
3428.. warning::
3429  Code object V4 is not the default code object version emitted by this version
3430  of LLVM.
3431
3432Code object V4 metadata is the same as
3433:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3434defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`.
3435
3436  .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3`
3437     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3438
3439     ================= ============== ========= =======================================
3440     String Key        Value Type     Required? Description
3441     ================= ============== ========= =======================================
3442     "amdhsa.version"  sequence of    Required  - The first integer is the major
3443                       2 integers                 version. Currently 1.
3444                                                - The second integer is the minor
3445                                                  version. Currently 1.
3446     "amdhsa.target"   string         Required  The target name of the code using the syntax:
3447
3448                                                .. code::
3449
3450                                                  <target-triple> [ "-" <target-id> ]
3451
3452                                                A canonical target ID must be
3453                                                used. See :ref:`amdgpu-target-triples`
3454                                                and :ref:`amdgpu-target-id`.
3455     ================= ============== ========= =======================================
3456
3457..
3458
3459Kernel Dispatch
3460~~~~~~~~~~~~~~~
3461
3462The HSA architected queuing language (AQL) defines a user space memory interface
3463that can be used to control the dispatch of kernels, in an agent independent
3464way. An agent can have zero or more AQL queues created for it using an HSA
3465compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3466are 64 bytes) can be placed. See the *HSA Platform System Architecture
3467Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3468
3469The packet processor of a kernel agent is responsible for detecting and
3470dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3471packet processor is implemented by the hardware command processor (CP),
3472asynchronous dispatch controller (ADC) and shader processor input controller
3473(SPI).
3474
3475An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3476the kernel mode driver to initialize and register the AQL queue with CP.
3477
3478To dispatch a kernel the following actions are performed. This can occur in the
3479CPU host program, or from an HSA kernel executing on a GPU.
3480
34811. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3482   executed is obtained.
34832. A pointer to the kernel descriptor (see
3484   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3485   It must be for a kernel that is contained in a code object that that was
3486   loaded by an HSA compatible runtime on the kernel agent with which the AQL
3487   queue is associated.
34883. Space is allocated for the kernel arguments using the HSA compatible runtime
3489   allocator for a memory region with the kernarg property for the kernel agent
3490   that will execute the kernel. It must be at least 16-byte aligned.
34914. Kernel argument values are assigned to the kernel argument memory
3492   allocation. The layout is defined in the *HSA Programmer's Language
3493   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3494   kernel argument memory in the same way constant memory is accessed. (Note
3495   that the HSA specification allows an implementation to copy the kernel
3496   argument contents to another location that is accessed by the kernel.)
34975. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3498   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3499   for the packet. The packet must be set up, and the final write must use an
3500   atomic store release to set the packet kind to ensure the packet contents are
3501   visible to the kernel agent. AQL defines a doorbell signal mechanism to
3502   notify the kernel agent that the AQL queue has been updated. These rules, and
3503   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3504   System Architecture Specification* [HSA]_.
35056. A kernel dispatch packet includes information about the actual dispatch,
3506   such as grid and work-group size, together with information from the code
3507   object about the kernel, such as segment sizes. The HSA compatible runtime
3508   queries on the kernel symbol can be used to obtain the code object values
3509   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
35107. CP executes micro-code and is responsible for detecting and setting up the
3511   GPU to execute the wavefronts of a kernel dispatch.
35128. CP ensures that when the a wavefront starts executing the kernel machine
3513   code, the scalar general purpose registers (SGPR) and vector general purpose
3514   registers (VGPR) are set up as required by the machine code. The required
3515   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3516   register state is defined in
3517   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
35189. The prolog of the kernel machine code (see
3519   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3520   before continuing executing the machine code that corresponds to the kernel.
352110. When the kernel dispatch has completed execution, CP signals the completion
3522    signal specified in the kernel dispatch packet if not 0.
3523
3524.. _amdgpu-amdhsa-memory-spaces:
3525
3526Memory Spaces
3527~~~~~~~~~~~~~
3528
3529The memory space properties are:
3530
3531  .. table:: AMDHSA Memory Spaces
3532     :name: amdgpu-amdhsa-memory-spaces-table
3533
3534     ================= =========== ======== ======= ==================
3535     Memory Space Name HSA Segment Hardware Address NULL Value
3536                       Name        Name     Size
3537     ================= =========== ======== ======= ==================
3538     Private           private     scratch  32      0x00000000
3539     Local             group       LDS      32      0xFFFFFFFF
3540     Global            global      global   64      0x0000000000000000
3541     Constant          constant    *same as 64      0x0000000000000000
3542                                   global*
3543     Generic           flat        flat     64      0x0000000000000000
3544     Region            N/A         GDS      32      *not implemented
3545                                                    for AMDHSA*
3546     ================= =========== ======== ======= ==================
3547
3548The global and constant memory spaces both use global virtual addresses, which
3549are the same virtual address space used by the CPU. However, some virtual
3550addresses may only be accessible to the CPU, some only accessible by the GPU,
3551and some by both.
3552
3553Using the constant memory space indicates that the data will not change during
3554the execution of the kernel. This allows scalar read instructions to be
3555used. The vector and scalar L1 caches are invalidated of volatile data before
3556each kernel dispatch execution to allow constant memory to change values between
3557kernel dispatches.
3558
3559The local memory space uses the hardware Local Data Store (LDS) which is
3560automatically allocated when the hardware creates work-groups of wavefronts, and
3561freed when all the wavefronts of a work-group have terminated. The data store
3562(DS) instructions can be used to access it.
3563
3564The private memory space uses the hardware scratch memory support. If the kernel
3565uses scratch, then the hardware allocates memory that is accessed using
3566wavefront lane dword (4 byte) interleaving. The mapping used from private
3567address to physical address is:
3568
3569  ``wavefront-scratch-base +
3570  (private-address * wavefront-size * 4) +
3571  (wavefront-lane-id * 4)``
3572
3573There are different ways that the wavefront scratch base address is determined
3574by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3575memory can be accessed in an interleaved manner using buffer instruction with
3576the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3577instructions, or by flat instructions. If each lane of a wavefront accesses the
3578same private address, the interleaving results in adjacent dwords being accessed
3579and hence requires fewer cache lines to be fetched. Multi-dword access is not
3580supported except by flat and scratch instructions in GFX9-GFX10.
3581
3582The generic address space uses the hardware flat address support available in
3583GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3584local apertures), that are outside the range of addressible global memory, to
3585map from a flat address to a private or local address.
3586
3587FLAT instructions can take a flat address and access global, private (scratch)
3588and group (LDS) memory depending in if the address is within one of the
3589aperture ranges. Flat access to scratch requires hardware aperture setup and
3590setup in the kernel prologue (see
3591:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3592hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3593:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3594
3595To convert between a segment address and a flat address the base address of the
3596apertures address can be used. For GFX7-GFX8 these are available in the
3597:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3598Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3599GFX9-GFX10 the aperture base addresses are directly available as inline constant
3600registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3601address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3602which makes it easier to convert from flat to segment or segment to flat.
3603
3604Image and Samplers
3605~~~~~~~~~~~~~~~~~~
3606
3607Image and sample handles created by an HSA compatible runtime (see
3608:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3609object respectively. In order to support the HSA ``query_sampler`` operations
3610two extra dwords are used to store the HSA BRIG enumeration values for the
3611queries that are not trivially deducible from the S# representation.
3612
3613HSA Signals
3614~~~~~~~~~~~
3615
3616HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3617are 64-bit addresses of a structure allocated in memory accessible from both the
3618CPU and GPU. The structure is defined by the runtime and subject to change
3619between releases. For example, see [AMD-ROCm-github]_.
3620
3621.. _amdgpu-amdhsa-hsa-aql-queue:
3622
3623HSA AQL Queue
3624~~~~~~~~~~~~~
3625
3626The HSA AQL queue structure is defined by an HSA compatible runtime (see
3627:ref:`amdgpu-os`) and subject to change between releases. For example, see
3628[AMD-ROCm-github]_. For some processors it contains fields needed to implement
3629certain language features such as the flat address aperture bases. It also
3630contains fields used by CP such as managing the allocation of scratch memory.
3631
3632.. _amdgpu-amdhsa-kernel-descriptor:
3633
3634Kernel Descriptor
3635~~~~~~~~~~~~~~~~~
3636
3637A kernel descriptor consists of the information needed by CP to initiate the
3638execution of a kernel, including the entry point address of the machine code
3639that implements the kernel.
3640
3641Code Object V3 Kernel Descriptor
3642++++++++++++++++++++++++++++++++
3643
3644CP microcode requires the Kernel descriptor to be allocated on 64-byte
3645alignment.
3646
3647The fields used by CP for code objects before V3 also match those specified in
3648:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3649
3650  .. table:: Code Object V3 Kernel Descriptor
3651     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3652
3653     ======= ======= =============================== ============================
3654     Bits    Size    Field Name                      Description
3655     ======= ======= =============================== ============================
3656     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3657                                                     address space memory
3658                                                     required for a work-group
3659                                                     in bytes. This does not
3660                                                     include any dynamically
3661                                                     allocated local address
3662                                                     space memory that may be
3663                                                     added when the kernel is
3664                                                     dispatched.
3665     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3666                                                     private address space
3667                                                     memory required for a
3668                                                     work-item in bytes.
3669                                                     Additional space may need to
3670                                                     be added to this value if
3671                                                     the call stack has
3672                                                     non-inlined function calls.
3673     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3674                                                     memory pointed to by the
3675                                                     AQL dispatch packet. The
3676                                                     kernarg memory is used to
3677                                                     pass arguments to the
3678                                                     kernel.
3679
3680                                                     * If the kernarg pointer in
3681                                                       the dispatch packet is NULL
3682                                                       then there are no kernel
3683                                                       arguments.
3684                                                     * If the kernarg pointer in
3685                                                       the dispatch packet is
3686                                                       not NULL and this value
3687                                                       is 0 then the kernarg
3688                                                       memory size is
3689                                                       unspecified.
3690                                                     * If the kernarg pointer in
3691                                                       the dispatch packet is
3692                                                       not NULL and this value
3693                                                       is not 0 then the value
3694                                                       specifies the kernarg
3695                                                       memory size in bytes. It
3696                                                       is recommended to provide
3697                                                       a value as it may be used
3698                                                       by CP to optimize making
3699                                                       the kernarg memory
3700                                                       visible to the kernel
3701                                                       code.
3702
3703     127:96  4 bytes                                 Reserved, must be 0.
3704     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3705                                                     negative) from base
3706                                                     address of kernel
3707                                                     descriptor to kernel's
3708                                                     entry point instruction
3709                                                     which must be 256 byte
3710                                                     aligned.
3711     351:272 20                                      Reserved, must be 0.
3712             bytes
3713     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3714                                                       Reserved, must be 0.
3715                                                     GFX90A
3716                                                       Compute Shader (CS)
3717                                                       program settings used by
3718                                                       CP to set up
3719                                                       ``COMPUTE_PGM_RSRC3``
3720                                                       configuration
3721                                                       register. See
3722                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3723                                                     GFX10
3724                                                       Compute Shader (CS)
3725                                                       program settings used by
3726                                                       CP to set up
3727                                                       ``COMPUTE_PGM_RSRC3``
3728                                                       configuration
3729                                                       register. See
3730                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3731     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3732                                                     program settings used by
3733                                                     CP to set up
3734                                                     ``COMPUTE_PGM_RSRC1``
3735                                                     configuration
3736                                                     register. See
3737                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3738     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3739                                                     program settings used by
3740                                                     CP to set up
3741                                                     ``COMPUTE_PGM_RSRC2``
3742                                                     configuration
3743                                                     register. See
3744                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3745     458:448 7 bits  *See separate bits below.*      Enable the setup of the
3746                                                     SGPR user data registers
3747                                                     (see
3748                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3749
3750                                                     The total number of SGPR
3751                                                     user data registers
3752                                                     requested must not exceed
3753                                                     16 and match value in
3754                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3755                                                     Any requests beyond 16
3756                                                     will be ignored.
3757     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3758                     _BUFFER                         column of
3759                                                     :ref:`amdgpu-processor-table`
3760                                                     specifies *Architected flat
3761                                                     scratch* then not supported
3762                                                     and must be 0,
3763     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3764     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3765     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3766     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3767     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3768                                                     column of
3769                                                     :ref:`amdgpu-processor-table`
3770                                                     specifies *Architected flat
3771                                                     scratch* then not supported
3772                                                     and must be 0,
3773     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3774                     _SIZE
3775     457:455 3 bits                                  Reserved, must be 0.
3776     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3777                                                       Reserved, must be 0.
3778                                                     GFX10
3779                                                       - If 0 execute in
3780                                                         wavefront size 64 mode.
3781                                                       - If 1 execute in
3782                                                         native wavefront size
3783                                                         32 mode.
3784     463:459 1 bit                                   Reserved, must be 0.
3785     464     1 bit   RESERVED_464                    Deprecated, must be 0.
3786     467:465 3 bits                                  Reserved, must be 0.
3787     468     1 bit   RESERVED_468                    Deprecated, must be 0.
3788     469:471 3 bits                                  Reserved, must be 0.
3789     511:472 5 bytes                                 Reserved, must be 0.
3790     512     **Total size 64 bytes.**
3791     ======= ====================================================================
3792
3793..
3794
3795  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3796     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3797
3798     ======= ======= =============================== ===========================================================================
3799     Bits    Size    Field Name                      Description
3800     ======= ======= =============================== ===========================================================================
3801     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
3802                                                     blocks used by each work-item;
3803                                                     granularity is device
3804                                                     specific:
3805
3806                                                     GFX6-GFX9
3807                                                       - vgprs_used 0..256
3808                                                       - max(0, ceil(vgprs_used / 4) - 1)
3809                                                     GFX90A
3810                                                       - vgprs_used 0..512
3811                                                       - vgprs_used = align(arch_vgprs, 4)
3812                                                                      + acc_vgprs
3813                                                       - max(0, ceil(vgprs_used / 8) - 1)
3814                                                     GFX10 (wavefront size 64)
3815                                                       - max_vgpr 1..256
3816                                                       - max(0, ceil(vgprs_used / 4) - 1)
3817                                                     GFX10 (wavefront size 32)
3818                                                       - max_vgpr 1..256
3819                                                       - max(0, ceil(vgprs_used / 8) - 1)
3820
3821                                                     Where vgprs_used is defined
3822                                                     as the highest VGPR number
3823                                                     explicitly referenced plus
3824                                                     one.
3825
3826                                                     Used by CP to set up
3827                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
3828
3829                                                     The
3830                                                     :ref:`amdgpu-assembler`
3831                                                     calculates this
3832                                                     automatically for the
3833                                                     selected processor from
3834                                                     values provided to the
3835                                                     `.amdhsa_kernel` directive
3836                                                     by the
3837                                                     `.amdhsa_next_free_vgpr`
3838                                                     nested directive (see
3839                                                     :ref:`amdhsa-kernel-directives-table`).
3840     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3841                                                     blocks used by a wavefront;
3842                                                     granularity is device
3843                                                     specific:
3844
3845                                                     GFX6-GFX8
3846                                                       - sgprs_used 0..112
3847                                                       - max(0, ceil(sgprs_used / 8) - 1)
3848                                                     GFX9
3849                                                       - sgprs_used 0..112
3850                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
3851                                                     GFX10
3852                                                       Reserved, must be 0.
3853                                                       (128 SGPRs always
3854                                                       allocated.)
3855
3856                                                     Where sgprs_used is
3857                                                     defined as the highest
3858                                                     SGPR number explicitly
3859                                                     referenced plus one, plus
3860                                                     a target specific number
3861                                                     of additional special
3862                                                     SGPRs for VCC,
3863                                                     FLAT_SCRATCH (GFX7+) and
3864                                                     XNACK_MASK (GFX8+), and
3865                                                     any additional
3866                                                     target specific
3867                                                     limitations. It does not
3868                                                     include the 16 SGPRs added
3869                                                     if a trap handler is
3870                                                     enabled.
3871
3872                                                     The target specific
3873                                                     limitations and special
3874                                                     SGPR layout are defined in
3875                                                     the hardware
3876                                                     documentation, which can
3877                                                     be found in the
3878                                                     :ref:`amdgpu-processors`
3879                                                     table.
3880
3881                                                     Used by CP to set up
3882                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
3883
3884                                                     The
3885                                                     :ref:`amdgpu-assembler`
3886                                                     calculates this
3887                                                     automatically for the
3888                                                     selected processor from
3889                                                     values provided to the
3890                                                     `.amdhsa_kernel` directive
3891                                                     by the
3892                                                     `.amdhsa_next_free_sgpr`
3893                                                     and `.amdhsa_reserve_*`
3894                                                     nested directives (see
3895                                                     :ref:`amdhsa-kernel-directives-table`).
3896     11:10   2 bits  PRIORITY                        Must be 0.
3897
3898                                                     Start executing wavefront
3899                                                     at the specified priority.
3900
3901                                                     CP is responsible for
3902                                                     filling in
3903                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
3904     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
3905                                                     with specified rounding
3906                                                     mode for single (32
3907                                                     bit) floating point
3908                                                     precision floating point
3909                                                     operations.
3910
3911                                                     Floating point rounding
3912                                                     mode values are defined in
3913                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3914
3915                                                     Used by CP to set up
3916                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3917     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
3918                                                     with specified rounding
3919                                                     denorm mode for half/double (16
3920                                                     and 64-bit) floating point
3921                                                     precision floating point
3922                                                     operations.
3923
3924                                                     Floating point rounding
3925                                                     mode values are defined in
3926                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3927
3928                                                     Used by CP to set up
3929                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3930     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
3931                                                     with specified denorm mode
3932                                                     for single (32
3933                                                     bit)  floating point
3934                                                     precision floating point
3935                                                     operations.
3936
3937                                                     Floating point denorm mode
3938                                                     values are defined in
3939                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3940
3941                                                     Used by CP to set up
3942                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3943     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
3944                                                     with specified denorm mode
3945                                                     for half/double (16
3946                                                     and 64-bit) floating point
3947                                                     precision floating point
3948                                                     operations.
3949
3950                                                     Floating point denorm mode
3951                                                     values are defined in
3952                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3953
3954                                                     Used by CP to set up
3955                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3956     20      1 bit   PRIV                            Must be 0.
3957
3958                                                     Start executing wavefront
3959                                                     in privilege trap handler
3960                                                     mode.
3961
3962                                                     CP is responsible for
3963                                                     filling in
3964                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
3965     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
3966                                                     with DX10 clamp mode
3967                                                     enabled. Used by the vector
3968                                                     ALU to force DX10 style
3969                                                     treatment of NaN's (when
3970                                                     set, clamp NaN to zero,
3971                                                     otherwise pass NaN
3972                                                     through).
3973
3974                                                     Used by CP to set up
3975                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3976     22      1 bit   DEBUG_MODE                      Must be 0.
3977
3978                                                     Start executing wavefront
3979                                                     in single step mode.
3980
3981                                                     CP is responsible for
3982                                                     filling in
3983                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3984     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
3985                                                     with IEEE mode
3986                                                     enabled. Floating point
3987                                                     opcodes that support
3988                                                     exception flag gathering
3989                                                     will quiet and propagate
3990                                                     signaling-NaN inputs per
3991                                                     IEEE 754-2008. Min_dx10 and
3992                                                     max_dx10 become IEEE
3993                                                     754-2008 compliant due to
3994                                                     signaling-NaN propagation
3995                                                     and quieting.
3996
3997                                                     Used by CP to set up
3998                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3999     24      1 bit   BULKY                           Must be 0.
4000
4001                                                     Only one work-group allowed
4002                                                     to execute on a compute
4003                                                     unit.
4004
4005                                                     CP is responsible for
4006                                                     filling in
4007                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
4008     25      1 bit   CDBG_USER                       Must be 0.
4009
4010                                                     Flag that can be used to
4011                                                     control debugging code.
4012
4013                                                     CP is responsible for
4014                                                     filling in
4015                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4016     26      1 bit   FP16_OVFL                       GFX6-GFX8
4017                                                       Reserved, must be 0.
4018                                                     GFX9-GFX10
4019                                                       Wavefront starts execution
4020                                                       with specified fp16 overflow
4021                                                       mode.
4022
4023                                                       - If 0, fp16 overflow generates
4024                                                         +/-INF values.
4025                                                       - If 1, fp16 overflow that is the
4026                                                         result of an +/-INF input value
4027                                                         or divide by 0 produces a +/-INF,
4028                                                         otherwise clamps computed
4029                                                         overflow to +/-MAX_FP16 as
4030                                                         appropriate.
4031
4032                                                       Used by CP to set up
4033                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4034     28:27   2 bits                                  Reserved, must be 0.
4035     29      1 bit    WGP_MODE                       GFX6-GFX9
4036                                                       Reserved, must be 0.
4037                                                     GFX10
4038                                                       - If 0 execute work-groups in
4039                                                         CU wavefront execution mode.
4040                                                       - If 1 execute work-groups on
4041                                                         in WGP wavefront execution mode.
4042
4043                                                       See :ref:`amdgpu-amdhsa-memory-model`.
4044
4045                                                       Used by CP to set up
4046                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4047     30      1 bit    MEM_ORDERED                    GFX6-GFX9
4048                                                       Reserved, must be 0.
4049                                                     GFX10
4050                                                       Controls the behavior of the
4051                                                       s_waitcnt's vmcnt and vscnt
4052                                                       counters.
4053
4054                                                       - If 0 vmcnt reports completion
4055                                                         of load and atomic with return
4056                                                         out of order with sample
4057                                                         instructions, and the vscnt
4058                                                         reports the completion of
4059                                                         store and atomic without
4060                                                         return in order.
4061                                                       - If 1 vmcnt reports completion
4062                                                         of load, atomic with return
4063                                                         and sample instructions in
4064                                                         order, and the vscnt reports
4065                                                         the completion of store and
4066                                                         atomic without return in order.
4067
4068                                                       Used by CP to set up
4069                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4070     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4071                                                       Reserved, must be 0.
4072                                                     GFX10
4073                                                       - If 0 execute SIMD wavefronts
4074                                                         using oldest first policy.
4075                                                       - If 1 execute SIMD wavefronts to
4076                                                         ensure wavefronts will make some
4077                                                         forward progress.
4078
4079                                                       Used by CP to set up
4080                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4081     32      **Total size 4 bytes**
4082     ======= ===================================================================================================================
4083
4084..
4085
4086  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4087     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4088
4089     ======= ======= =============================== ===========================================================================
4090     Bits    Size    Field Name                      Description
4091     ======= ======= =============================== ===========================================================================
4092     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4093                                                       private segment.
4094                                                     * If the *Target Properties*
4095                                                       column of
4096                                                       :ref:`amdgpu-processor-table`
4097                                                       does not specify
4098                                                       *Architected flat
4099                                                       scratch* then enable the
4100                                                       setup of the SGPR
4101                                                       wavefront scratch offset
4102                                                       system register (see
4103                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4104                                                     * If the *Target Properties*
4105                                                       column of
4106                                                       :ref:`amdgpu-processor-table`
4107                                                       specifies *Architected
4108                                                       flat scratch* then enable
4109                                                       the setup of the
4110                                                       FLAT_SCRATCH register
4111                                                       pair (see
4112                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4113
4114                                                     Used by CP to set up
4115                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4116     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4117                                                     user data registers
4118                                                     requested. This number must
4119                                                     match the number of user
4120                                                     data registers enabled.
4121
4122                                                     Used by CP to set up
4123                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4124     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4125
4126                                                     This bit represents
4127                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4128                                                     which is set by the CP if
4129                                                     the runtime has installed a
4130                                                     trap handler.
4131     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4132                                                     system SGPR register for
4133                                                     the work-group id in the X
4134                                                     dimension (see
4135                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4136
4137                                                     Used by CP to set up
4138                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4139     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4140                                                     system SGPR register for
4141                                                     the work-group id in the Y
4142                                                     dimension (see
4143                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4144
4145                                                     Used by CP to set up
4146                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4147     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4148                                                     system SGPR register for
4149                                                     the work-group id in the Z
4150                                                     dimension (see
4151                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4152
4153                                                     Used by CP to set up
4154                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4155     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4156                                                     system SGPR register for
4157                                                     work-group information (see
4158                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4159
4160                                                     Used by CP to set up
4161                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4162     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4163                                                     VGPR system registers used
4164                                                     for the work-item ID.
4165                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4166                                                     defines the values.
4167
4168                                                     Used by CP to set up
4169                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4170     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4171
4172                                                     Wavefront starts execution
4173                                                     with address watch
4174                                                     exceptions enabled which
4175                                                     are generated when L1 has
4176                                                     witnessed a thread access
4177                                                     an *address of
4178                                                     interest*.
4179
4180                                                     CP is responsible for
4181                                                     filling in the address
4182                                                     watch bit in
4183                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4184                                                     according to what the
4185                                                     runtime requests.
4186     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4187
4188                                                     Wavefront starts execution
4189                                                     with memory violation
4190                                                     exceptions exceptions
4191                                                     enabled which are generated
4192                                                     when a memory violation has
4193                                                     occurred for this wavefront from
4194                                                     L1 or LDS
4195                                                     (write-to-read-only-memory,
4196                                                     mis-aligned atomic, LDS
4197                                                     address out of range,
4198                                                     illegal address, etc.).
4199
4200                                                     CP sets the memory
4201                                                     violation bit in
4202                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4203                                                     according to what the
4204                                                     runtime requests.
4205     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4206
4207                                                     CP uses the rounded value
4208                                                     from the dispatch packet,
4209                                                     not this value, as the
4210                                                     dispatch may contain
4211                                                     dynamically allocated group
4212                                                     segment memory. CP writes
4213                                                     directly to
4214                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4215
4216                                                     Amount of group segment
4217                                                     (LDS) to allocate for each
4218                                                     work-group. Granularity is
4219                                                     device specific:
4220
4221                                                     GFX6
4222                                                       roundup(lds-size / (64 * 4))
4223                                                     GFX7-GFX10
4224                                                       roundup(lds-size / (128 * 4))
4225
4226     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4227                     _INVALID_OPERATION              with specified exceptions
4228                                                     enabled.
4229
4230                                                     Used by CP to set up
4231                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
4232                                                     (set from bits 0..6).
4233
4234                                                     IEEE 754 FP Invalid
4235                                                     Operation
4236     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4237                     _SOURCE                         input operands is a
4238                                                     denormal number
4239     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4240                     _DIVISION_BY_ZERO               Zero
4241     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4242                     _OVERFLOW
4243     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4244                     _UNDERFLOW
4245     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4246                     _INEXACT
4247     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4248                     _ZERO                           (rcp_iflag_f32 instruction
4249                                                     only)
4250     31      1 bit                                   Reserved, must be 0.
4251     32      **Total size 4 bytes.**
4252     ======= ===================================================================================================================
4253
4254..
4255
4256  .. table:: compute_pgm_rsrc3 for GFX90A
4257     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4258
4259     ======= ======= =============================== ===========================================================================
4260     Bits    Size    Field Name                      Description
4261     ======= ======= =============================== ===========================================================================
4262     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4263                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4264                                                     63 - accum-offset = 256.
4265     6:15    10                                      Reserved, must be 0.
4266             bits
4267     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4268                                                       launched in the same CU.
4269                                                     - If 1 the waves of a work-group can be
4270                                                       launched in different CUs. The waves
4271                                                       cannot use S_BARRIER or LDS.
4272     17:31   15                                      Reserved, must be 0.
4273             bits
4274     32      **Total size 4 bytes.**
4275     ======= ===================================================================================================================
4276
4277..
4278
4279  .. table:: compute_pgm_rsrc3 for GFX10
4280     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4281
4282     ======= ======= =============================== ===========================================================================
4283     Bits    Size    Field Name                      Description
4284     ======= ======= =============================== ===========================================================================
4285     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
4286                                                     compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
4287     31:4    28                                      Reserved, must be 0.
4288             bits
4289     32      **Total size 4 bytes.**
4290     ======= ===================================================================================================================
4291
4292..
4293
4294  .. table:: Floating Point Rounding Mode Enumeration Values
4295     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4296
4297     ====================================== ===== ==============================
4298     Enumeration Name                       Value Description
4299     ====================================== ===== ==============================
4300     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4301     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4302     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4303     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4304     ====================================== ===== ==============================
4305
4306..
4307
4308  .. table:: Floating Point Denorm Mode Enumeration Values
4309     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4310
4311     ====================================== ===== ==============================
4312     Enumeration Name                       Value Description
4313     ====================================== ===== ==============================
4314     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4315                                                  Denorms
4316     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4317     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4318     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4319     ====================================== ===== ==============================
4320
4321..
4322
4323  .. table:: System VGPR Work-Item ID Enumeration Values
4324     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4325
4326     ======================================== ===== ============================
4327     Enumeration Name                         Value Description
4328     ======================================== ===== ============================
4329     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4330                                                    ID.
4331     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4332                                                    dimensions ID.
4333     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4334                                                    dimensions ID.
4335     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4336     ======================================== ===== ============================
4337
4338.. _amdgpu-amdhsa-initial-kernel-execution-state:
4339
4340Initial Kernel Execution State
4341~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4342
4343This section defines the register state that will be set up by the packet
4344processor prior to the start of execution of every wavefront. This is limited by
4345the constraints of the hardware controllers of CP/ADC/SPI.
4346
4347The order of the SGPR registers is defined, but the compiler can specify which
4348ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4349fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4350for enabled registers are dense starting at SGPR0: the first enabled register is
4351SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4352an SGPR number.
4353
4354The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4355all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4356using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4357actually initialized. These are then immediately followed by the System SGPRs
4358that are set up by ADC/SPI and can have different values for each wavefront of
4359the grid dispatch.
4360
4361SGPR register initial state is defined in
4362:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4363
4364  .. table:: SGPR Register Set Up Order
4365     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4366
4367     ========== ========================== ====== ==============================
4368     SGPR Order Name                       Number Description
4369                (kernel descriptor enable  of
4370                field)                     SGPRs
4371     ========== ========================== ====== ==============================
4372     First      Private Segment Buffer     4      See
4373                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4374                _segment_buffer)
4375     then       Dispatch Ptr               2      64-bit address of AQL dispatch
4376                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4377                                                  actually executing.
4378     then       Queue Ptr                  2      64-bit address of amd_queue_t
4379                (enable_sgpr_queue_ptr)           object for AQL queue on which
4380                                                  the dispatch packet was
4381                                                  queued.
4382     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4383                (enable_sgpr_kernarg              segment. This is directly
4384                _segment_ptr)                     copied from the
4385                                                  kernarg_address in the kernel
4386                                                  dispatch packet.
4387
4388                                                  Having CP load it once avoids
4389                                                  loading it at the beginning of
4390                                                  every wavefront.
4391     then       Dispatch Id                2      64-bit Dispatch ID of the
4392                (enable_sgpr_dispatch_id)         dispatch packet being
4393                                                  executed.
4394     then       Flat Scratch Init          2      See
4395                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4396                _init)
4397     then       Private Segment Size       1      The 32-bit byte size of a
4398                (enable_sgpr_private              single work-item's memory
4399                _segment_size)                    allocation. This is the
4400                                                  value from the kernel
4401                                                  dispatch packet Private
4402                                                  Segment Byte Size rounded up
4403                                                  by CP to a multiple of
4404                                                  DWORD.
4405
4406                                                  Having CP load it once avoids
4407                                                  loading it at the beginning of
4408                                                  every wavefront.
4409
4410                                                  This is not used for
4411                                                  GFX7-GFX8 since it is the same
4412                                                  value as the second SGPR of
4413                                                  Flat Scratch Init. However, it
4414                                                  may be needed for GFX9-GFX10 which
4415                                                  changes the meaning of the
4416                                                  Flat Scratch Init value.
4417     then       Work-Group Id X            1      32-bit work-group id in X
4418                (enable_sgpr_workgroup_id         dimension of grid for
4419                _X)                               wavefront.
4420     then       Work-Group Id Y            1      32-bit work-group id in Y
4421                (enable_sgpr_workgroup_id         dimension of grid for
4422                _Y)                               wavefront.
4423     then       Work-Group Id Z            1      32-bit work-group id in Z
4424                (enable_sgpr_workgroup_id         dimension of grid for
4425                _Z)                               wavefront.
4426     then       Work-Group Info            1      {first_wavefront, 14'b0000,
4427                (enable_sgpr_workgroup            ordered_append_term[10:0],
4428                _info)                            threadgroup_size_in_wavefronts[5:0]}
4429     then       Scratch Wavefront Offset   1      See
4430                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4431                _segment_wavefront_offset)        and
4432                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4433     ========== ========================== ====== ==============================
4434
4435The order of the VGPR registers is defined, but the compiler can specify which
4436ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4437fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4438for enabled registers are dense starting at VGPR0: the first enabled register is
4439VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4440VGPR number.
4441
4442There are different methods used for the VGPR initial state:
4443
4444* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4445  specifies otherwise, a separate VGPR register is used per work-item ID. The
4446  VGPR register initial state for this method is defined in
4447  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4448* If *Target Properties* column of :ref:`amdgpu-processor-table`
4449  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4450  for all work-item IDs. The register layout for this method is defined in
4451  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4452
4453  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4454     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4455
4456     ========== ========================== ====== ==============================
4457     VGPR Order Name                       Number Description
4458                (kernel descriptor enable  of
4459                field)                     VGPRs
4460     ========== ========================== ====== ==============================
4461     First      Work-Item Id X             1      32-bit work-item id in X
4462                (Always initialized)              dimension of work-group for
4463                                                  wavefront lane.
4464     then       Work-Item Id Y             1      32-bit work-item id in Y
4465                (enable_vgpr_workitem_id          dimension of work-group for
4466                > 0)                              wavefront lane.
4467     then       Work-Item Id Z             1      32-bit work-item id in Z
4468                (enable_vgpr_workitem_id          dimension of work-group for
4469                > 1)                              wavefront lane.
4470     ========== ========================== ====== ==============================
4471
4472..
4473
4474  .. table:: Register Layout for Packed Work-Item ID Method
4475     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4476
4477     ======= ======= ================ =========================================
4478     Bits    Size    Field Name       Description
4479     ======= ======= ================ =========================================
4480     0:9     10 bits Work-Item Id X   Work-item id in X
4481                                      dimension of work-group for
4482                                      wavefront lane.
4483
4484                                      Always initialized.
4485
4486     10:19   10 bits Work-Item Id Y   Work-item id in Y
4487                                      dimension of work-group for
4488                                      wavefront lane.
4489
4490                                      Initialized if enable_vgpr_workitem_id >
4491                                      0, otherwise set to 0.
4492     20:29   10 bits Work-Item Id Z   Work-item id in Z
4493                                      dimension of work-group for
4494                                      wavefront lane.
4495
4496                                      Initialized if enable_vgpr_workitem_id >
4497                                      1, otherwise set to 0.
4498     30:31   2 bits                   Reserved, set to 0.
4499     ======= ======= ================ =========================================
4500
4501The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4502
45031. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4504   registers.
45052. Work-group Id registers X, Y, Z are set by ADC which supports any
4506   combination including none.
45073. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4508   its value cannot be included with the flat scratch init value which is per
4509   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
45104. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4511   or (X, Y, Z).
45125. Flat Scratch register pair initialization is described in
4513   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4514
4515The global segment can be accessed either using buffer instructions (GFX6 which
4516has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4517instructions (GFX9-GFX10).
4518
4519If buffer operations are used, then the compiler can generate a V# with the
4520following properties:
4521
4522* base address of 0
4523* no swizzle
4524* ATC: 1 if IOMMU present (such as APU)
4525* ptr64: 1
4526* MTYPE set to support memory coherence that matches the runtime (such as CC for
4527  APU and NC for dGPU).
4528
4529.. _amdgpu-amdhsa-kernel-prolog:
4530
4531Kernel Prolog
4532~~~~~~~~~~~~~
4533
4534The compiler performs initialization in the kernel prologue depending on the
4535target and information about things like stack usage in the kernel and called
4536functions. Some of this initialization requires the compiler to request certain
4537User and System SGPRs be present in the
4538:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4539:ref:`amdgpu-amdhsa-kernel-descriptor`.
4540
4541.. _amdgpu-amdhsa-kernel-prolog-cfi:
4542
4543CFI
4544+++
4545
45461.  The CFI return address is undefined.
4547
45482.  The CFI CFA is defined using an expression which evaluates to a location
4549    description that comprises one memory location description for the
4550    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4551
4552.. _amdgpu-amdhsa-kernel-prolog-m0:
4553
4554M0
4555++
4556
4557GFX6-GFX8
4558  The M0 register must be initialized with a value at least the total LDS size
4559  if the kernel may access LDS via DS or flat operations. Total LDS size is
4560  available in dispatch packet. For M0, it is also possible to use maximum
4561  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4562  GFX7-GFX8).
4563GFX9-GFX10
4564  The M0 register is not used for range checking LDS accesses and so does not
4565  need to be initialized in the prolog.
4566
4567.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4568
4569Stack Pointer
4570+++++++++++++
4571
4572If the kernel has function calls it must set up the ABI stack pointer described
4573in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4574SGPR32 to the unswizzled scratch offset of the address past the last local
4575allocation.
4576
4577.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4578
4579Frame Pointer
4580+++++++++++++
4581
4582If the kernel needs a frame pointer for the reasons defined in
4583``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4584kernel prolog. If a frame pointer is not required then all uses of the frame
4585pointer are replaced with immediate ``0`` offsets.
4586
4587.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4588
4589Flat Scratch
4590++++++++++++
4591
4592There are different methods used for initializing flat scratch:
4593
4594* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4595  specifies *Does not support generic address space*:
4596
4597  Flat scratch is not supported and there is no flat scratch register pair.
4598
4599* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4600  specifies *Offset flat scratch*:
4601
4602  If the kernel or any function it calls may use flat operations to access
4603  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4604  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4605  Scratch Wavefront Offset SGPR registers (see
4606  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4607
4608  1. The low word of Flat Scratch Init is the 32-bit byte offset from
4609     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4610     being managed by SPI for the queue executing the kernel dispatch. This is
4611     the same value used in the Scratch Segment Buffer V# base address.
4612
4613     CP obtains this from the runtime. (The Scratch Segment Buffer base address
4614     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4615
4616     The prolog must add the value of Scratch Wavefront Offset to get the
4617     wavefront's byte scratch backing memory offset from
4618     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4619
4620     The Scratch Wavefront Offset must also be used as an offset with Private
4621     segment address when using the Scratch Segment Buffer.
4622
4623     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4624     shifted by 8 before moving into FLAT_SCRATCH_HI.
4625
4626     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4627     SGPRn is the highest numbered SGPR allocated to the wavefront).
4628     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4629     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4630     FLAT SCRATCH BASE in flat memory instructions that access the scratch
4631     aperture.
4632  2. The second word of Flat Scratch Init is 32-bit byte size of a single
4633     work-items scratch memory usage.
4634
4635     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4636     checks that the value in the kernel dispatch packet Private Segment Byte
4637     Size is not larger and requests the runtime to increase the queue's scratch
4638     size if necessary.
4639
4640     CP directly loads from the kernel dispatch packet Private Segment Byte Size
4641     field and rounds up to a multiple of DWORD. Having CP load it once avoids
4642     loading it at the beginning of every wavefront.
4643
4644     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4645     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4646     in flat memory instructions.
4647
4648* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4649  specifies *Absolute flat scratch*:
4650
4651  If the kernel or any function it calls may use flat operations to access
4652  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4653  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4654  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4655  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4656
4657  The Flat Scratch Init is the 64-bit address of the base of scratch backing
4658  memory being managed by SPI for the queue executing the kernel dispatch.
4659
4660  CP obtains this from the runtime.
4661
4662  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4663  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4664  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4665  memory instructions.
4666
4667  The Scratch Wavefront Offset must also be used as an offset with Private
4668  segment address when using the Scratch Segment Buffer (see
4669  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4670
4671* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4672  specifies *Architected flat scratch*:
4673
4674  If ENABLE_PRIVATE_SEGMENT is enabled in
4675  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4676  register pair will be initialized to the 64-bit address of the base of scratch
4677  backing memory being managed by SPI for the queue executing the kernel
4678  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4679  flat scratch base in flat memory instructions.
4680
4681.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4682
4683Private Segment Buffer
4684++++++++++++++++++++++
4685
4686If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4687*Architected flat scratch* then a Private Segment Buffer is not supported.
4688Instead the flat SCRATCH instructions are used.
4689
4690Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4691that are used as a V# to access scratch. CP uses the value provided by the
4692runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4693access the private memory space using a segment address. See
4694:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4695
4696The scratch V# is a four-aligned SGPR and always selected for the kernel as
4697follows:
4698
4699  - If it is known during instruction selection that there is stack usage,
4700    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4701    optimizations are disabled (``-O0``), if stack objects already exist (for
4702    locals, etc.), or if there are any function calls.
4703
4704  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4705    are reserved for the tentative scratch V#. These will be used if it is
4706    determined that spilling is needed.
4707
4708    - If no use is made of the tentative scratch V#, then it is unreserved,
4709      and the register count is determined ignoring it.
4710    - If use is made of the tentative scratch V#, then its register numbers
4711      are shifted to the first four-aligned SGPR index after the highest one
4712      allocated by the register allocator, and all uses are updated. The
4713      register count includes them in the shifted location.
4714    - In either case, if the processor has the SGPR allocation bug, the
4715      tentative allocation is not shifted or unreserved in order to ensure
4716      the register count is higher to workaround the bug.
4717
4718    .. note::
4719
4720      This approach of using a tentative scratch V# and shifting the register
4721      numbers if used avoids having to perform register allocation a second
4722      time if the tentative V# is eliminated. This is more efficient and
4723      avoids the problem that the second register allocation may perform
4724      spilling which will fail as there is no longer a scratch V#.
4725
4726When the kernel prolog code is being emitted it is known whether the scratch V#
4727described above is actually used. If it is, the prolog code must set it up by
4728copying the Private Segment Buffer to the scratch V# registers and then adding
4729the Private Segment Wavefront Offset to the queue base address in the V#. The
4730result is a V# with a base address pointing to the beginning of the wavefront
4731scratch backing memory.
4732
4733The Private Segment Buffer is always requested, but the Private Segment
4734Wavefront Offset is only requested if it is used (see
4735:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4736
4737.. _amdgpu-amdhsa-memory-model:
4738
4739Memory Model
4740~~~~~~~~~~~~
4741
4742This section describes the mapping of the LLVM memory model onto AMDGPU machine
4743code (see :ref:`memmodel`).
4744
4745The AMDGPU backend supports the memory synchronization scopes specified in
4746:ref:`amdgpu-memory-scopes`.
4747
4748The code sequences used to implement the memory model specify the order of
4749instructions that a single thread must execute. The ``s_waitcnt`` and cache
4750management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4751to other memory instructions executed by the same thread. This allows them to be
4752moved earlier or later which can allow them to be combined with other instances
4753of the same instruction, or hoisted/sunk out of loops to improve performance.
4754Only the instructions related to the memory model are given; additional
4755``s_waitcnt`` instructions are required to ensure registers are defined before
4756being used. These may be able to be combined with the memory model ``s_waitcnt``
4757instructions as described above.
4758
4759The AMDGPU backend supports the following memory models:
4760
4761  HSA Memory Model [HSA]_
4762    The HSA memory model uses a single happens-before relation for all address
4763    spaces (see :ref:`amdgpu-address-spaces`).
4764  OpenCL Memory Model [OpenCL]_
4765    The OpenCL memory model which has separate happens-before relations for the
4766    global and local address spaces. Only a fence specifying both global and
4767    local address space, and seq_cst instructions join the relationships. Since
4768    the LLVM ``memfence`` instruction does not allow an address space to be
4769    specified the OpenCL fence has to conservatively assume both local and
4770    global address space was specified. However, optimizations can often be
4771    done to eliminate the additional ``s_waitcnt`` instructions when there are
4772    no intervening memory instructions which access the corresponding address
4773    space. The code sequences in the table indicate what can be omitted for the
4774    OpenCL memory. The target triple environment is used to determine if the
4775    source language is OpenCL (see :ref:`amdgpu-opencl`).
4776
4777``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4778operations.
4779
4780``buffer/global/flat_load/store/atomic`` instructions to global memory are
4781termed vector memory operations.
4782
4783Private address space uses ``buffer_load/store`` using the scratch V#
4784(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4785is accessing the memory, atomic memory orderings are not meaningful, and all
4786accesses are treated as non-atomic.
4787
4788Constant address space uses ``buffer/global_load`` instructions (or equivalent
4789scalar memory instructions). Since the constant address space contents do not
4790change during the execution of a kernel dispatch it is not legal to perform
4791stores, and atomic memory orderings are not meaningful, and all accesses are
4792treated as non-atomic.
4793
4794A memory synchronization scope wider than work-group is not meaningful for the
4795group (LDS) address space and is treated as work-group.
4796
4797The memory model does not support the region address space which is treated as
4798non-atomic.
4799
4800Acquire memory ordering is not meaningful on store atomic instructions and is
4801treated as non-atomic.
4802
4803Release memory ordering is not meaningful on load atomic instructions and is
4804treated a non-atomic.
4805
4806Acquire-release memory ordering is not meaningful on load or store atomic
4807instructions and is treated as acquire and release respectively.
4808
4809The memory order also adds the single thread optimization constraints defined in
4810table
4811:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4812
4813  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4814     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4815
4816     ============ ==============================================================
4817     LLVM Memory  Optimization Constraints
4818     Ordering
4819     ============ ==============================================================
4820     unordered    *none*
4821     monotonic    *none*
4822     acquire      - If a load atomic/atomicrmw then no following load/load
4823                    atomic/store/store atomic/atomicrmw/fence instruction can be
4824                    moved before the acquire.
4825                  - If a fence then same as load atomic, plus no preceding
4826                    associated fence-paired-atomic can be moved after the fence.
4827     release      - If a store atomic/atomicrmw then no preceding load/load
4828                    atomic/store/store atomic/atomicrmw/fence instruction can be
4829                    moved after the release.
4830                  - If a fence then same as store atomic, plus no following
4831                    associated fence-paired-atomic can be moved before the
4832                    fence.
4833     acq_rel      Same constraints as both acquire and release.
4834     seq_cst      - If a load atomic then same constraints as acquire, plus no
4835                    preceding sequentially consistent load atomic/store
4836                    atomic/atomicrmw/fence instruction can be moved after the
4837                    seq_cst.
4838                  - If a store atomic then the same constraints as release, plus
4839                    no following sequentially consistent load atomic/store
4840                    atomic/atomicrmw/fence instruction can be moved before the
4841                    seq_cst.
4842                  - If an atomicrmw/fence then same constraints as acq_rel.
4843     ============ ==============================================================
4844
4845The code sequences used to implement the memory model are defined in the
4846following sections:
4847
4848* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
4849* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
4850* :ref:`amdgpu-amdhsa-memory-model-gfx10`
4851
4852.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
4853
4854Memory Model GFX6-GFX9
4855++++++++++++++++++++++
4856
4857For GFX6-GFX9:
4858
4859* Each agent has multiple shader arrays (SA).
4860* Each SA has multiple compute units (CU).
4861* Each CU has multiple SIMDs that execute wavefronts.
4862* The wavefronts for a single work-group are executed in the same CU but may be
4863  executed by different SIMDs.
4864* Each CU has a single LDS memory shared by the wavefronts of the work-groups
4865  executing on it.
4866* All LDS operations of a CU are performed as wavefront wide operations in a
4867  global order and involve no caching. Completion is reported to a wavefront in
4868  execution order.
4869* The LDS memory has multiple request queues shared by the SIMDs of a
4870  CU. Therefore, the LDS operations performed by different wavefronts of a
4871  work-group can be reordered relative to each other, which can result in
4872  reordering the visibility of vector memory operations with respect to LDS
4873  operations of other wavefronts in the same work-group. A ``s_waitcnt
4874  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4875  vector memory operations between wavefronts of a work-group, but not between
4876  operations performed by the same wavefront.
4877* The vector memory operations are performed as wavefront wide operations and
4878  completion is reported to a wavefront in execution order. The exception is
4879  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4880  vector memory order if they access LDS memory, and out of LDS operation order
4881  if they access global memory.
4882* The vector memory operations access a single vector L1 cache shared by all
4883  SIMDs a CU. Therefore, no special action is required for coherence between the
4884  lanes of a single wavefront, or for coherence between wavefronts in the same
4885  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4886  wavefronts executing in different work-groups as they may be executing on
4887  different CUs.
4888* The scalar memory operations access a scalar L1 cache shared by all wavefronts
4889  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4890  scalar operations are used in a restricted way so do not impact the memory
4891  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
4892* The vector and scalar memory operations use an L2 cache shared by all CUs on
4893  the same agent.
4894* The L2 cache has independent channels to service disjoint ranges of virtual
4895  addresses.
4896* Each CU has a separate request queue per channel. Therefore, the vector and
4897  scalar memory operations performed by wavefronts executing in different
4898  work-groups (which may be executing on different CUs) of an agent can be
4899  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4900  ensure synchronization between vector memory operations of different CUs. It
4901  ensures a previous vector memory operation has completed before executing a
4902  subsequent vector memory or LDS operation and so can be used to meet the
4903  requirements of acquire and release.
4904* The L2 cache can be kept coherent with other agents on some targets, or ranges
4905  of virtual addresses can be set up to bypass it to ensure system coherence.
4906
4907Scalar memory operations are only used to access memory that is proven to not
4908change during the execution of the kernel dispatch. This includes constant
4909address space and global address space for program scope ``const`` variables.
4910Therefore, the kernel machine code does not have to maintain the scalar cache to
4911ensure it is coherent with the vector caches. The scalar and vector caches are
4912invalidated between kernel dispatches by CP since constant address space data
4913may change between kernel dispatch executions. See
4914:ref:`amdgpu-amdhsa-memory-spaces`.
4915
4916The one exception is if scalar writes are used to spill SGPR registers. In this
4917case the AMDGPU backend ensures the memory location used to spill is never
4918accessed by vector memory operations at the same time. If scalar writes are used
4919then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4920return since the locations may be used for vector memory instructions by a
4921future wavefront that uses the same scratch area, or a function call that
4922creates a frame at the same address, respectively. There is no need for a
4923``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4924
4925For kernarg backing memory:
4926
4927* CP invalidates the L1 cache at the start of each kernel dispatch.
4928* On dGPU the kernarg backing memory is allocated in host memory accessed as
4929  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
4930  causes it to be treated as non-volatile and so is not invalidated by
4931  ``*_vol``.
4932* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
4933  and so the L2 cache will be coherent with the CPU and other agents.
4934
4935Scratch backing memory (which is used for the private address space) is accessed
4936with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
4937only accessed by a single thread, and is always write-before-read, there is
4938never a need to invalidate these entries from the L1 cache. Hence all cache
4939invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
4940
4941The code sequences used to implement the memory model for GFX6-GFX9 are defined
4942in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
4943
4944  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
4945     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
4946
4947     ============ ============ ============== ========== ================================
4948     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
4949                  Ordering     Sync Scope     Address    GFX6-GFX9
4950                                              Space
4951     ============ ============ ============== ========== ================================
4952     **Non-Atomic**
4953     ------------------------------------------------------------------------------------
4954     load         *none*       *none*         - global   - !volatile & !nontemporal
4955                                              - generic
4956                                              - private    1. buffer/global/flat_load
4957                                              - constant
4958                                                         - !volatile & nontemporal
4959
4960                                                           1. buffer/global/flat_load
4961                                                              glc=1 slc=1
4962
4963                                                         - volatile
4964
4965                                                           1. buffer/global/flat_load
4966                                                              glc=1
4967                                                           2. s_waitcnt vmcnt(0)
4968
4969                                                            - Must happen before
4970                                                              any following volatile
4971                                                              global/generic
4972                                                              load/store.
4973                                                            - Ensures that
4974                                                              volatile
4975                                                              operations to
4976                                                              different
4977                                                              addresses will not
4978                                                              be reordered by
4979                                                              hardware.
4980
4981     load         *none*       *none*         - local    1. ds_load
4982     store        *none*       *none*         - global   - !volatile & !nontemporal
4983                                              - generic
4984                                              - private    1. buffer/global/flat_store
4985                                              - constant
4986                                                         - !volatile & nontemporal
4987
4988                                                           1. buffer/global/flat_store
4989                                                              glc=1 slc=1
4990
4991                                                         - volatile
4992
4993                                                           1. buffer/global/flat_store
4994                                                           2. s_waitcnt vmcnt(0)
4995
4996                                                            - Must happen before
4997                                                              any following volatile
4998                                                              global/generic
4999                                                              load/store.
5000                                                            - Ensures that
5001                                                              volatile
5002                                                              operations to
5003                                                              different
5004                                                              addresses will not
5005                                                              be reordered by
5006                                                              hardware.
5007
5008     store        *none*       *none*         - local    1. ds_store
5009     **Unordered Atomic**
5010     ------------------------------------------------------------------------------------
5011     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5012     store atomic unordered    *any*          *any*      *Same as non-atomic*.
5013     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5014     **Monotonic Atomic**
5015     ------------------------------------------------------------------------------------
5016     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5017                               - wavefront    - local
5018                               - workgroup    - generic
5019     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5020                               - system       - generic     glc=1
5021     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5022                               - wavefront    - generic
5023                               - workgroup
5024                               - agent
5025                               - system
5026     store atomic monotonic    - singlethread - local    1. ds_store
5027                               - wavefront
5028                               - workgroup
5029     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5030                               - wavefront    - generic
5031                               - workgroup
5032                               - agent
5033                               - system
5034     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5035                               - wavefront
5036                               - workgroup
5037     **Acquire Atomic**
5038     ------------------------------------------------------------------------------------
5039     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5040                               - wavefront    - local
5041                                              - generic
5042     load atomic  acquire      - workgroup    - global   1. buffer/global_load
5043     load atomic  acquire      - workgroup    - local    1. ds/flat_load
5044                                              - generic  2. s_waitcnt lgkmcnt(0)
5045
5046                                                           - If OpenCL, omit.
5047                                                           - Must happen before
5048                                                             any following
5049                                                             global/generic
5050                                                             load/load
5051                                                             atomic/store/store
5052                                                             atomic/atomicrmw.
5053                                                           - Ensures any
5054                                                             following global
5055                                                             data read is no
5056                                                             older than a local load
5057                                                             atomic value being
5058                                                             acquired.
5059
5060     load atomic  acquire      - agent        - global   1. buffer/global_load
5061                               - system                     glc=1
5062                                                         2. s_waitcnt vmcnt(0)
5063
5064                                                           - Must happen before
5065                                                             following
5066                                                             buffer_wbinvl1_vol.
5067                                                           - Ensures the load
5068                                                             has completed
5069                                                             before invalidating
5070                                                             the cache.
5071
5072                                                         3. buffer_wbinvl1_vol
5073
5074                                                           - Must happen before
5075                                                             any following
5076                                                             global/generic
5077                                                             load/load
5078                                                             atomic/atomicrmw.
5079                                                           - Ensures that
5080                                                             following
5081                                                             loads will not see
5082                                                             stale global data.
5083
5084     load atomic  acquire      - agent        - generic  1. flat_load glc=1
5085                               - system                  2. s_waitcnt vmcnt(0) &
5086                                                            lgkmcnt(0)
5087
5088                                                           - If OpenCL omit
5089                                                             lgkmcnt(0).
5090                                                           - Must happen before
5091                                                             following
5092                                                             buffer_wbinvl1_vol.
5093                                                           - Ensures the flat_load
5094                                                             has completed
5095                                                             before invalidating
5096                                                             the cache.
5097
5098                                                         3. buffer_wbinvl1_vol
5099
5100                                                           - Must happen before
5101                                                             any following
5102                                                             global/generic
5103                                                             load/load
5104                                                             atomic/atomicrmw.
5105                                                           - Ensures that
5106                                                             following loads
5107                                                             will not see stale
5108                                                             global data.
5109
5110     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5111                               - wavefront    - local
5112                                              - generic
5113     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5114     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5115                                              - generic  2. s_waitcnt lgkmcnt(0)
5116
5117                                                           - If OpenCL, omit.
5118                                                           - Must happen before
5119                                                             any following
5120                                                             global/generic
5121                                                             load/load
5122                                                             atomic/store/store
5123                                                             atomic/atomicrmw.
5124                                                           - Ensures any
5125                                                             following global
5126                                                             data read is no
5127                                                             older than a local
5128                                                             atomicrmw value
5129                                                             being acquired.
5130
5131     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5132                               - system                  2. s_waitcnt vmcnt(0)
5133
5134                                                           - Must happen before
5135                                                             following
5136                                                             buffer_wbinvl1_vol.
5137                                                           - Ensures the
5138                                                             atomicrmw has
5139                                                             completed before
5140                                                             invalidating the
5141                                                             cache.
5142
5143                                                         3. buffer_wbinvl1_vol
5144
5145                                                           - Must happen before
5146                                                             any following
5147                                                             global/generic
5148                                                             load/load
5149                                                             atomic/atomicrmw.
5150                                                           - Ensures that
5151                                                             following loads
5152                                                             will not see stale
5153                                                             global data.
5154
5155     atomicrmw    acquire      - agent        - generic  1. flat_atomic
5156                               - system                  2. s_waitcnt vmcnt(0) &
5157                                                            lgkmcnt(0)
5158
5159                                                           - If OpenCL, omit
5160                                                             lgkmcnt(0).
5161                                                           - Must happen before
5162                                                             following
5163                                                             buffer_wbinvl1_vol.
5164                                                           - Ensures the
5165                                                             atomicrmw has
5166                                                             completed before
5167                                                             invalidating the
5168                                                             cache.
5169
5170                                                         3. buffer_wbinvl1_vol
5171
5172                                                           - Must happen before
5173                                                             any following
5174                                                             global/generic
5175                                                             load/load
5176                                                             atomic/atomicrmw.
5177                                                           - Ensures that
5178                                                             following loads
5179                                                             will not see stale
5180                                                             global data.
5181
5182     fence        acquire      - singlethread *none*     *none*
5183                               - wavefront
5184     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5185
5186                                                           - If OpenCL and
5187                                                             address space is
5188                                                             not generic, omit.
5189                                                           - However, since LLVM
5190                                                             currently has no
5191                                                             address space on
5192                                                             the fence need to
5193                                                             conservatively
5194                                                             always generate. If
5195                                                             fence had an
5196                                                             address space then
5197                                                             set to address
5198                                                             space of OpenCL
5199                                                             fence flag, or to
5200                                                             generic if both
5201                                                             local and global
5202                                                             flags are
5203                                                             specified.
5204                                                           - Must happen after
5205                                                             any preceding
5206                                                             local/generic load
5207                                                             atomic/atomicrmw
5208                                                             with an equal or
5209                                                             wider sync scope
5210                                                             and memory ordering
5211                                                             stronger than
5212                                                             unordered (this is
5213                                                             termed the
5214                                                             fence-paired-atomic).
5215                                                           - Must happen before
5216                                                             any following
5217                                                             global/generic
5218                                                             load/load
5219                                                             atomic/store/store
5220                                                             atomic/atomicrmw.
5221                                                           - Ensures any
5222                                                             following global
5223                                                             data read is no
5224                                                             older than the
5225                                                             value read by the
5226                                                             fence-paired-atomic.
5227
5228     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5229                               - system                     vmcnt(0)
5230
5231                                                           - If OpenCL and
5232                                                             address space is
5233                                                             not generic, omit
5234                                                             lgkmcnt(0).
5235                                                           - However, since LLVM
5236                                                             currently has no
5237                                                             address space on
5238                                                             the fence need to
5239                                                             conservatively
5240                                                             always generate
5241                                                             (see comment for
5242                                                             previous fence).
5243                                                           - Could be split into
5244                                                             separate s_waitcnt
5245                                                             vmcnt(0) and
5246                                                             s_waitcnt
5247                                                             lgkmcnt(0) to allow
5248                                                             them to be
5249                                                             independently moved
5250                                                             according to the
5251                                                             following rules.
5252                                                           - s_waitcnt vmcnt(0)
5253                                                             must happen after
5254                                                             any preceding
5255                                                             global/generic load
5256                                                             atomic/atomicrmw
5257                                                             with an equal or
5258                                                             wider sync scope
5259                                                             and memory ordering
5260                                                             stronger than
5261                                                             unordered (this is
5262                                                             termed the
5263                                                             fence-paired-atomic).
5264                                                           - s_waitcnt lgkmcnt(0)
5265                                                             must happen after
5266                                                             any preceding
5267                                                             local/generic load
5268                                                             atomic/atomicrmw
5269                                                             with an equal or
5270                                                             wider sync scope
5271                                                             and memory ordering
5272                                                             stronger than
5273                                                             unordered (this is
5274                                                             termed the
5275                                                             fence-paired-atomic).
5276                                                           - Must happen before
5277                                                             the following
5278                                                             buffer_wbinvl1_vol.
5279                                                           - Ensures that the
5280                                                             fence-paired atomic
5281                                                             has completed
5282                                                             before invalidating
5283                                                             the
5284                                                             cache. Therefore
5285                                                             any following
5286                                                             locations read must
5287                                                             be no older than
5288                                                             the value read by
5289                                                             the
5290                                                             fence-paired-atomic.
5291
5292                                                         2. buffer_wbinvl1_vol
5293
5294                                                           - Must happen before any
5295                                                             following global/generic
5296                                                             load/load
5297                                                             atomic/store/store
5298                                                             atomic/atomicrmw.
5299                                                           - Ensures that
5300                                                             following loads
5301                                                             will not see stale
5302                                                             global data.
5303
5304     **Release Atomic**
5305     ------------------------------------------------------------------------------------
5306     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5307                               - wavefront    - local
5308                                              - generic
5309     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5310                                              - generic
5311                                                           - If OpenCL, omit.
5312                                                           - Must happen after
5313                                                             any preceding
5314                                                             local/generic
5315                                                             load/store/load
5316                                                             atomic/store
5317                                                             atomic/atomicrmw.
5318                                                           - Must happen before
5319                                                             the following
5320                                                             store.
5321                                                           - Ensures that all
5322                                                             memory operations
5323                                                             to local have
5324                                                             completed before
5325                                                             performing the
5326                                                             store that is being
5327                                                             released.
5328
5329                                                         2. buffer/global/flat_store
5330     store atomic release      - workgroup    - local    1. ds_store
5331     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5332                               - system       - generic     vmcnt(0)
5333
5334                                                           - If OpenCL and
5335                                                             address space is
5336                                                             not generic, omit
5337                                                             lgkmcnt(0).
5338                                                           - Could be split into
5339                                                             separate s_waitcnt
5340                                                             vmcnt(0) and
5341                                                             s_waitcnt
5342                                                             lgkmcnt(0) to allow
5343                                                             them to be
5344                                                             independently moved
5345                                                             according to the
5346                                                             following rules.
5347                                                           - s_waitcnt vmcnt(0)
5348                                                             must happen after
5349                                                             any preceding
5350                                                             global/generic
5351                                                             load/store/load
5352                                                             atomic/store
5353                                                             atomic/atomicrmw.
5354                                                           - s_waitcnt lgkmcnt(0)
5355                                                             must happen after
5356                                                             any preceding
5357                                                             local/generic
5358                                                             load/store/load
5359                                                             atomic/store
5360                                                             atomic/atomicrmw.
5361                                                           - Must happen before
5362                                                             the following
5363                                                             store.
5364                                                           - Ensures that all
5365                                                             memory operations
5366                                                             to memory have
5367                                                             completed before
5368                                                             performing the
5369                                                             store that is being
5370                                                             released.
5371
5372                                                         2. buffer/global/flat_store
5373     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5374                               - wavefront    - local
5375                                              - generic
5376     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5377                                              - generic
5378                                                           - If OpenCL, omit.
5379                                                           - Must happen after
5380                                                             any preceding
5381                                                             local/generic
5382                                                             load/store/load
5383                                                             atomic/store
5384                                                             atomic/atomicrmw.
5385                                                           - Must happen before
5386                                                             the following
5387                                                             atomicrmw.
5388                                                           - Ensures that all
5389                                                             memory operations
5390                                                             to local have
5391                                                             completed before
5392                                                             performing the
5393                                                             atomicrmw that is
5394                                                             being released.
5395
5396                                                         2. buffer/global/flat_atomic
5397     atomicrmw    release      - workgroup    - local    1. ds_atomic
5398     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5399                               - system       - generic     vmcnt(0)
5400
5401                                                           - If OpenCL, omit
5402                                                             lgkmcnt(0).
5403                                                           - Could be split into
5404                                                             separate s_waitcnt
5405                                                             vmcnt(0) and
5406                                                             s_waitcnt
5407                                                             lgkmcnt(0) to allow
5408                                                             them to be
5409                                                             independently moved
5410                                                             according to the
5411                                                             following rules.
5412                                                           - s_waitcnt vmcnt(0)
5413                                                             must happen after
5414                                                             any preceding
5415                                                             global/generic
5416                                                             load/store/load
5417                                                             atomic/store
5418                                                             atomic/atomicrmw.
5419                                                           - s_waitcnt lgkmcnt(0)
5420                                                             must happen after
5421                                                             any preceding
5422                                                             local/generic
5423                                                             load/store/load
5424                                                             atomic/store
5425                                                             atomic/atomicrmw.
5426                                                           - Must happen before
5427                                                             the following
5428                                                             atomicrmw.
5429                                                           - Ensures that all
5430                                                             memory operations
5431                                                             to global and local
5432                                                             have completed
5433                                                             before performing
5434                                                             the atomicrmw that
5435                                                             is being released.
5436
5437                                                         2. buffer/global/flat_atomic
5438     fence        release      - singlethread *none*     *none*
5439                               - wavefront
5440     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5441
5442                                                           - If OpenCL and
5443                                                             address space is
5444                                                             not generic, omit.
5445                                                           - However, since LLVM
5446                                                             currently has no
5447                                                             address space on
5448                                                             the fence need to
5449                                                             conservatively
5450                                                             always generate. If
5451                                                             fence had an
5452                                                             address space then
5453                                                             set to address
5454                                                             space of OpenCL
5455                                                             fence flag, or to
5456                                                             generic if both
5457                                                             local and global
5458                                                             flags are
5459                                                             specified.
5460                                                           - Must happen after
5461                                                             any preceding
5462                                                             local/generic
5463                                                             load/load
5464                                                             atomic/store/store
5465                                                             atomic/atomicrmw.
5466                                                           - Must happen before
5467                                                             any following store
5468                                                             atomic/atomicrmw
5469                                                             with an equal or
5470                                                             wider sync scope
5471                                                             and memory ordering
5472                                                             stronger than
5473                                                             unordered (this is
5474                                                             termed the
5475                                                             fence-paired-atomic).
5476                                                           - Ensures that all
5477                                                             memory operations
5478                                                             to local have
5479                                                             completed before
5480                                                             performing the
5481                                                             following
5482                                                             fence-paired-atomic.
5483
5484     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5485                               - system                     vmcnt(0)
5486
5487                                                           - If OpenCL and
5488                                                             address space is
5489                                                             not generic, omit
5490                                                             lgkmcnt(0).
5491                                                           - If OpenCL and
5492                                                             address space is
5493                                                             local, omit
5494                                                             vmcnt(0).
5495                                                           - However, since LLVM
5496                                                             currently has no
5497                                                             address space on
5498                                                             the fence need to
5499                                                             conservatively
5500                                                             always generate. If
5501                                                             fence had an
5502                                                             address space then
5503                                                             set to address
5504                                                             space of OpenCL
5505                                                             fence flag, or to
5506                                                             generic if both
5507                                                             local and global
5508                                                             flags are
5509                                                             specified.
5510                                                           - Could be split into
5511                                                             separate s_waitcnt
5512                                                             vmcnt(0) and
5513                                                             s_waitcnt
5514                                                             lgkmcnt(0) to allow
5515                                                             them to be
5516                                                             independently moved
5517                                                             according to the
5518                                                             following rules.
5519                                                           - s_waitcnt vmcnt(0)
5520                                                             must happen after
5521                                                             any preceding
5522                                                             global/generic
5523                                                             load/store/load
5524                                                             atomic/store
5525                                                             atomic/atomicrmw.
5526                                                           - s_waitcnt lgkmcnt(0)
5527                                                             must happen after
5528                                                             any preceding
5529                                                             local/generic
5530                                                             load/store/load
5531                                                             atomic/store
5532                                                             atomic/atomicrmw.
5533                                                           - Must happen before
5534                                                             any following store
5535                                                             atomic/atomicrmw
5536                                                             with an equal or
5537                                                             wider sync scope
5538                                                             and memory ordering
5539                                                             stronger than
5540                                                             unordered (this is
5541                                                             termed the
5542                                                             fence-paired-atomic).
5543                                                           - Ensures that all
5544                                                             memory operations
5545                                                             have
5546                                                             completed before
5547                                                             performing the
5548                                                             following
5549                                                             fence-paired-atomic.
5550
5551     **Acquire-Release Atomic**
5552     ------------------------------------------------------------------------------------
5553     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5554                               - wavefront    - local
5555                                              - generic
5556     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5557
5558                                                           - If OpenCL, omit.
5559                                                           - Must happen after
5560                                                             any preceding
5561                                                             local/generic
5562                                                             load/store/load
5563                                                             atomic/store
5564                                                             atomic/atomicrmw.
5565                                                           - Must happen before
5566                                                             the following
5567                                                             atomicrmw.
5568                                                           - Ensures that all
5569                                                             memory operations
5570                                                             to local have
5571                                                             completed before
5572                                                             performing the
5573                                                             atomicrmw that is
5574                                                             being released.
5575
5576                                                         2. buffer/global_atomic
5577
5578     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5579                                                         2. s_waitcnt lgkmcnt(0)
5580
5581                                                           - If OpenCL, omit.
5582                                                           - Must happen before
5583                                                             any following
5584                                                             global/generic
5585                                                             load/load
5586                                                             atomic/store/store
5587                                                             atomic/atomicrmw.
5588                                                           - Ensures any
5589                                                             following global
5590                                                             data read is no
5591                                                             older than the local load
5592                                                             atomic value being
5593                                                             acquired.
5594
5595     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5596
5597                                                           - If OpenCL, omit.
5598                                                           - Must happen after
5599                                                             any preceding
5600                                                             local/generic
5601                                                             load/store/load
5602                                                             atomic/store
5603                                                             atomic/atomicrmw.
5604                                                           - Must happen before
5605                                                             the following
5606                                                             atomicrmw.
5607                                                           - Ensures that all
5608                                                             memory operations
5609                                                             to local have
5610                                                             completed before
5611                                                             performing the
5612                                                             atomicrmw that is
5613                                                             being released.
5614
5615                                                         2. flat_atomic
5616                                                         3. s_waitcnt lgkmcnt(0)
5617
5618                                                           - If OpenCL, omit.
5619                                                           - Must happen before
5620                                                             any following
5621                                                             global/generic
5622                                                             load/load
5623                                                             atomic/store/store
5624                                                             atomic/atomicrmw.
5625                                                           - Ensures any
5626                                                             following global
5627                                                             data read is no
5628                                                             older than a local load
5629                                                             atomic value being
5630                                                             acquired.
5631
5632     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5633                               - system                     vmcnt(0)
5634
5635                                                           - If OpenCL, omit
5636                                                             lgkmcnt(0).
5637                                                           - Could be split into
5638                                                             separate s_waitcnt
5639                                                             vmcnt(0) and
5640                                                             s_waitcnt
5641                                                             lgkmcnt(0) to allow
5642                                                             them to be
5643                                                             independently moved
5644                                                             according to the
5645                                                             following rules.
5646                                                           - s_waitcnt vmcnt(0)
5647                                                             must happen after
5648                                                             any preceding
5649                                                             global/generic
5650                                                             load/store/load
5651                                                             atomic/store
5652                                                             atomic/atomicrmw.
5653                                                           - s_waitcnt lgkmcnt(0)
5654                                                             must happen after
5655                                                             any preceding
5656                                                             local/generic
5657                                                             load/store/load
5658                                                             atomic/store
5659                                                             atomic/atomicrmw.
5660                                                           - Must happen before
5661                                                             the following
5662                                                             atomicrmw.
5663                                                           - Ensures that all
5664                                                             memory operations
5665                                                             to global have
5666                                                             completed before
5667                                                             performing the
5668                                                             atomicrmw that is
5669                                                             being released.
5670
5671                                                         2. buffer/global_atomic
5672                                                         3. s_waitcnt vmcnt(0)
5673
5674                                                           - Must happen before
5675                                                             following
5676                                                             buffer_wbinvl1_vol.
5677                                                           - Ensures the
5678                                                             atomicrmw has
5679                                                             completed before
5680                                                             invalidating the
5681                                                             cache.
5682
5683                                                         4. buffer_wbinvl1_vol
5684
5685                                                           - Must happen before
5686                                                             any following
5687                                                             global/generic
5688                                                             load/load
5689                                                             atomic/atomicrmw.
5690                                                           - Ensures that
5691                                                             following loads
5692                                                             will not see stale
5693                                                             global data.
5694
5695     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5696                               - system                     vmcnt(0)
5697
5698                                                           - If OpenCL, omit
5699                                                             lgkmcnt(0).
5700                                                           - Could be split into
5701                                                             separate s_waitcnt
5702                                                             vmcnt(0) and
5703                                                             s_waitcnt
5704                                                             lgkmcnt(0) to allow
5705                                                             them to be
5706                                                             independently moved
5707                                                             according to the
5708                                                             following rules.
5709                                                           - s_waitcnt vmcnt(0)
5710                                                             must happen after
5711                                                             any preceding
5712                                                             global/generic
5713                                                             load/store/load
5714                                                             atomic/store
5715                                                             atomic/atomicrmw.
5716                                                           - s_waitcnt lgkmcnt(0)
5717                                                             must happen after
5718                                                             any preceding
5719                                                             local/generic
5720                                                             load/store/load
5721                                                             atomic/store
5722                                                             atomic/atomicrmw.
5723                                                           - Must happen before
5724                                                             the following
5725                                                             atomicrmw.
5726                                                           - Ensures that all
5727                                                             memory operations
5728                                                             to global have
5729                                                             completed before
5730                                                             performing the
5731                                                             atomicrmw that is
5732                                                             being released.
5733
5734                                                         2. flat_atomic
5735                                                         3. s_waitcnt vmcnt(0) &
5736                                                            lgkmcnt(0)
5737
5738                                                           - If OpenCL, omit
5739                                                             lgkmcnt(0).
5740                                                           - Must happen before
5741                                                             following
5742                                                             buffer_wbinvl1_vol.
5743                                                           - Ensures the
5744                                                             atomicrmw has
5745                                                             completed before
5746                                                             invalidating the
5747                                                             cache.
5748
5749                                                         4. buffer_wbinvl1_vol
5750
5751                                                           - Must happen before
5752                                                             any following
5753                                                             global/generic
5754                                                             load/load
5755                                                             atomic/atomicrmw.
5756                                                           - Ensures that
5757                                                             following loads
5758                                                             will not see stale
5759                                                             global data.
5760
5761     fence        acq_rel      - singlethread *none*     *none*
5762                               - wavefront
5763     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5764
5765                                                           - If OpenCL and
5766                                                             address space is
5767                                                             not generic, omit.
5768                                                           - However,
5769                                                             since LLVM
5770                                                             currently has no
5771                                                             address space on
5772                                                             the fence need to
5773                                                             conservatively
5774                                                             always generate
5775                                                             (see comment for
5776                                                             previous fence).
5777                                                           - Must happen after
5778                                                             any preceding
5779                                                             local/generic
5780                                                             load/load
5781                                                             atomic/store/store
5782                                                             atomic/atomicrmw.
5783                                                           - Must happen before
5784                                                             any following
5785                                                             global/generic
5786                                                             load/load
5787                                                             atomic/store/store
5788                                                             atomic/atomicrmw.
5789                                                           - Ensures that all
5790                                                             memory operations
5791                                                             to local have
5792                                                             completed before
5793                                                             performing any
5794                                                             following global
5795                                                             memory operations.
5796                                                           - Ensures that the
5797                                                             preceding
5798                                                             local/generic load
5799                                                             atomic/atomicrmw
5800                                                             with an equal or
5801                                                             wider sync scope
5802                                                             and memory ordering
5803                                                             stronger than
5804                                                             unordered (this is
5805                                                             termed the
5806                                                             acquire-fence-paired-atomic)
5807                                                             has completed
5808                                                             before following
5809                                                             global memory
5810                                                             operations. This
5811                                                             satisfies the
5812                                                             requirements of
5813                                                             acquire.
5814                                                           - Ensures that all
5815                                                             previous memory
5816                                                             operations have
5817                                                             completed before a
5818                                                             following
5819                                                             local/generic store
5820                                                             atomic/atomicrmw
5821                                                             with an equal or
5822                                                             wider sync scope
5823                                                             and memory ordering
5824                                                             stronger than
5825                                                             unordered (this is
5826                                                             termed the
5827                                                             release-fence-paired-atomic).
5828                                                             This satisfies the
5829                                                             requirements of
5830                                                             release.
5831
5832     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5833                               - system                     vmcnt(0)
5834
5835                                                           - If OpenCL and
5836                                                             address space is
5837                                                             not generic, omit
5838                                                             lgkmcnt(0).
5839                                                           - However, since LLVM
5840                                                             currently has no
5841                                                             address space on
5842                                                             the fence need to
5843                                                             conservatively
5844                                                             always generate
5845                                                             (see comment for
5846                                                             previous fence).
5847                                                           - Could be split into
5848                                                             separate s_waitcnt
5849                                                             vmcnt(0) and
5850                                                             s_waitcnt
5851                                                             lgkmcnt(0) to allow
5852                                                             them to be
5853                                                             independently moved
5854                                                             according to the
5855                                                             following rules.
5856                                                           - s_waitcnt vmcnt(0)
5857                                                             must happen after
5858                                                             any preceding
5859                                                             global/generic
5860                                                             load/store/load
5861                                                             atomic/store
5862                                                             atomic/atomicrmw.
5863                                                           - s_waitcnt lgkmcnt(0)
5864                                                             must happen after
5865                                                             any preceding
5866                                                             local/generic
5867                                                             load/store/load
5868                                                             atomic/store
5869                                                             atomic/atomicrmw.
5870                                                           - Must happen before
5871                                                             the following
5872                                                             buffer_wbinvl1_vol.
5873                                                           - Ensures that the
5874                                                             preceding
5875                                                             global/local/generic
5876                                                             load
5877                                                             atomic/atomicrmw
5878                                                             with an equal or
5879                                                             wider sync scope
5880                                                             and memory ordering
5881                                                             stronger than
5882                                                             unordered (this is
5883                                                             termed the
5884                                                             acquire-fence-paired-atomic)
5885                                                             has completed
5886                                                             before invalidating
5887                                                             the cache. This
5888                                                             satisfies the
5889                                                             requirements of
5890                                                             acquire.
5891                                                           - Ensures that all
5892                                                             previous memory
5893                                                             operations have
5894                                                             completed before a
5895                                                             following
5896                                                             global/local/generic
5897                                                             store
5898                                                             atomic/atomicrmw
5899                                                             with an equal or
5900                                                             wider sync scope
5901                                                             and memory ordering
5902                                                             stronger than
5903                                                             unordered (this is
5904                                                             termed the
5905                                                             release-fence-paired-atomic).
5906                                                             This satisfies the
5907                                                             requirements of
5908                                                             release.
5909
5910                                                         2. buffer_wbinvl1_vol
5911
5912                                                           - Must happen before
5913                                                             any following
5914                                                             global/generic
5915                                                             load/load
5916                                                             atomic/store/store
5917                                                             atomic/atomicrmw.
5918                                                           - Ensures that
5919                                                             following loads
5920                                                             will not see stale
5921                                                             global data. This
5922                                                             satisfies the
5923                                                             requirements of
5924                                                             acquire.
5925
5926     **Sequential Consistent Atomic**
5927     ------------------------------------------------------------------------------------
5928     load atomic  seq_cst      - singlethread - global   *Same as corresponding
5929                               - wavefront    - local    load atomic acquire,
5930                                              - generic  except must generate
5931                                                         all instructions even
5932                                                         for OpenCL.*
5933     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5934                                              - generic
5935
5936                                                           - Must
5937                                                             happen after
5938                                                             preceding
5939                                                             local/generic load
5940                                                             atomic/store
5941                                                             atomic/atomicrmw
5942                                                             with memory
5943                                                             ordering of seq_cst
5944                                                             and with equal or
5945                                                             wider sync scope.
5946                                                             (Note that seq_cst
5947                                                             fences have their
5948                                                             own s_waitcnt
5949                                                             lgkmcnt(0) and so do
5950                                                             not need to be
5951                                                             considered.)
5952                                                           - Ensures any
5953                                                             preceding
5954                                                             sequential
5955                                                             consistent local
5956                                                             memory instructions
5957                                                             have completed
5958                                                             before executing
5959                                                             this sequentially
5960                                                             consistent
5961                                                             instruction. This
5962                                                             prevents reordering
5963                                                             a seq_cst store
5964                                                             followed by a
5965                                                             seq_cst load. (Note
5966                                                             that seq_cst is
5967                                                             stronger than
5968                                                             acquire/release as
5969                                                             the reordering of
5970                                                             load acquire
5971                                                             followed by a store
5972                                                             release is
5973                                                             prevented by the
5974                                                             s_waitcnt of
5975                                                             the release, but
5976                                                             there is nothing
5977                                                             preventing a store
5978                                                             release followed by
5979                                                             load acquire from
5980                                                             completing out of
5981                                                             order. The s_waitcnt
5982                                                             could be placed after
5983                                                             seq_store or before
5984                                                             the seq_load. We
5985                                                             choose the load to
5986                                                             make the s_waitcnt be
5987                                                             as late as possible
5988                                                             so that the store
5989                                                             may have already
5990                                                             completed.)
5991
5992                                                         2. *Following
5993                                                            instructions same as
5994                                                            corresponding load
5995                                                            atomic acquire,
5996                                                            except must generate
5997                                                            all instructions even
5998                                                            for OpenCL.*
5999     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6000                                                         load atomic acquire,
6001                                                         except must generate
6002                                                         all instructions even
6003                                                         for OpenCL.*
6004
6005     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6006                               - system       - generic     vmcnt(0)
6007
6008                                                           - Could be split into
6009                                                             separate s_waitcnt
6010                                                             vmcnt(0)
6011                                                             and s_waitcnt
6012                                                             lgkmcnt(0) to allow
6013                                                             them to be
6014                                                             independently moved
6015                                                             according to the
6016                                                             following rules.
6017                                                           - s_waitcnt lgkmcnt(0)
6018                                                             must happen after
6019                                                             preceding
6020                                                             global/generic load
6021                                                             atomic/store
6022                                                             atomic/atomicrmw
6023                                                             with memory
6024                                                             ordering of seq_cst
6025                                                             and with equal or
6026                                                             wider sync scope.
6027                                                             (Note that seq_cst
6028                                                             fences have their
6029                                                             own s_waitcnt
6030                                                             lgkmcnt(0) and so do
6031                                                             not need to be
6032                                                             considered.)
6033                                                           - s_waitcnt vmcnt(0)
6034                                                             must happen after
6035                                                             preceding
6036                                                             global/generic load
6037                                                             atomic/store
6038                                                             atomic/atomicrmw
6039                                                             with memory
6040                                                             ordering of seq_cst
6041                                                             and with equal or
6042                                                             wider sync scope.
6043                                                             (Note that seq_cst
6044                                                             fences have their
6045                                                             own s_waitcnt
6046                                                             vmcnt(0) and so do
6047                                                             not need to be
6048                                                             considered.)
6049                                                           - Ensures any
6050                                                             preceding
6051                                                             sequential
6052                                                             consistent global
6053                                                             memory instructions
6054                                                             have completed
6055                                                             before executing
6056                                                             this sequentially
6057                                                             consistent
6058                                                             instruction. This
6059                                                             prevents reordering
6060                                                             a seq_cst store
6061                                                             followed by a
6062                                                             seq_cst load. (Note
6063                                                             that seq_cst is
6064                                                             stronger than
6065                                                             acquire/release as
6066                                                             the reordering of
6067                                                             load acquire
6068                                                             followed by a store
6069                                                             release is
6070                                                             prevented by the
6071                                                             s_waitcnt of
6072                                                             the release, but
6073                                                             there is nothing
6074                                                             preventing a store
6075                                                             release followed by
6076                                                             load acquire from
6077                                                             completing out of
6078                                                             order. The s_waitcnt
6079                                                             could be placed after
6080                                                             seq_store or before
6081                                                             the seq_load. We
6082                                                             choose the load to
6083                                                             make the s_waitcnt be
6084                                                             as late as possible
6085                                                             so that the store
6086                                                             may have already
6087                                                             completed.)
6088
6089                                                         2. *Following
6090                                                            instructions same as
6091                                                            corresponding load
6092                                                            atomic acquire,
6093                                                            except must generate
6094                                                            all instructions even
6095                                                            for OpenCL.*
6096     store atomic seq_cst      - singlethread - global   *Same as corresponding
6097                               - wavefront    - local    store atomic release,
6098                               - workgroup    - generic  except must generate
6099                               - agent                   all instructions even
6100                               - system                  for OpenCL.*
6101     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6102                               - wavefront    - local    atomicrmw acq_rel,
6103                               - workgroup    - generic  except must generate
6104                               - agent                   all instructions even
6105                               - system                  for OpenCL.*
6106     fence        seq_cst      - singlethread *none*     *Same as corresponding
6107                               - wavefront               fence acq_rel,
6108                               - workgroup               except must generate
6109                               - agent                   all instructions even
6110                               - system                  for OpenCL.*
6111     ============ ============ ============== ========== ================================
6112
6113.. _amdgpu-amdhsa-memory-model-gfx90a:
6114
6115Memory Model GFX90A
6116+++++++++++++++++++
6117
6118For GFX90A:
6119
6120* Each agent has multiple shader arrays (SA).
6121* Each SA has multiple compute units (CU).
6122* Each CU has multiple SIMDs that execute wavefronts.
6123* The wavefronts for a single work-group are executed in the same CU but may be
6124  executed by different SIMDs. The exception is when in tgsplit execution mode
6125  when the wavefronts may be executed by different SIMDs in different CUs.
6126* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6127  executing on it. The exception is when in tgsplit execution mode when no LDS
6128  is allocated as wavefronts of the same work-group can be in different CUs.
6129* All LDS operations of a CU are performed as wavefront wide operations in a
6130  global order and involve no caching. Completion is reported to a wavefront in
6131  execution order.
6132* The LDS memory has multiple request queues shared by the SIMDs of a
6133  CU. Therefore, the LDS operations performed by different wavefronts of a
6134  work-group can be reordered relative to each other, which can result in
6135  reordering the visibility of vector memory operations with respect to LDS
6136  operations of other wavefronts in the same work-group. A ``s_waitcnt
6137  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6138  vector memory operations between wavefronts of a work-group, but not between
6139  operations performed by the same wavefront.
6140* The vector memory operations are performed as wavefront wide operations and
6141  completion is reported to a wavefront in execution order. The exception is
6142  that ``flat_load/store/atomic`` instructions can report out of vector memory
6143  order if they access LDS memory, and out of LDS operation order if they access
6144  global memory.
6145* The vector memory operations access a single vector L1 cache shared by all
6146  SIMDs a CU. Therefore:
6147
6148  * No special action is required for coherence between the lanes of a single
6149    wavefront.
6150
6151  * No special action is required for coherence between wavefronts in the same
6152    work-group since they execute on the same CU. The exception is when in
6153    tgsplit execution mode as wavefronts of the same work-group can be in
6154    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6155    the following item.
6156
6157  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6158    executing in different work-groups as they may be executing on different
6159    CUs.
6160
6161* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6162  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6163  scalar operations are used in a restricted way so do not impact the memory
6164  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6165* The vector and scalar memory operations use an L2 cache shared by all CUs on
6166  the same agent.
6167
6168  * The L2 cache has independent channels to service disjoint ranges of virtual
6169    addresses.
6170  * Each CU has a separate request queue per channel. Therefore, the vector and
6171    scalar memory operations performed by wavefronts executing in different
6172    work-groups (which may be executing on different CUs), or the same
6173    work-group if executing in tgsplit mode, of an agent can be reordered
6174    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6175    synchronization between vector memory operations of different CUs. It
6176    ensures a previous vector memory operation has completed before executing a
6177    subsequent vector memory or LDS operation and so can be used to meet the
6178    requirements of acquire and release.
6179  * The L2 cache of one agent can be kept coherent with other agents by:
6180    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6181    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6182    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6183
6184    * Any local memory cache lines will be automatically invalidated by writes
6185      from CUs associated with other L2 caches, or writes from the CPU, due to
6186      the cache probe caused by coherent requests. Coherent requests are caused
6187      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6188      XGMI, and by PCIe requests that are configured to be coherent requests.
6189    * XGMI accesses from the CPU to local memory may be cached on the CPU.
6190      Subsequent access from the GPU will automatically invalidate or writeback
6191      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6192    * Since all work-groups on the same agent share the same L2, no L2
6193      invalidation or writeback is required for coherence.
6194    * To ensure coherence of local and remote memory writes of work-groups in
6195      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6196      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6197      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6198      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6199      remote fine grain memory) bypasses the L2, so both will never result in
6200      dirty L2 cache lines.
6201    * To ensure coherence of local and remote memory reads of work-groups in
6202      different agents a ``buffer_invl2`` is required. It will invalidate L2
6203      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6204      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6205      coarse memory) cause local reads to be invalidated by remote writes with
6206      with the PTE C-bit so these cache lines are not invalidated. Note that
6207      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6208      never result in L2 cache lines that need to be invalidated.
6209
6210  * PCIe access from the GPU to the CPU memory is kept coherent by using the
6211    MTYPE UC (uncached) which bypasses the L2.
6212
6213Scalar memory operations are only used to access memory that is proven to not
6214change during the execution of the kernel dispatch. This includes constant
6215address space and global address space for program scope ``const`` variables.
6216Therefore, the kernel machine code does not have to maintain the scalar cache to
6217ensure it is coherent with the vector caches. The scalar and vector caches are
6218invalidated between kernel dispatches by CP since constant address space data
6219may change between kernel dispatch executions. See
6220:ref:`amdgpu-amdhsa-memory-spaces`.
6221
6222The one exception is if scalar writes are used to spill SGPR registers. In this
6223case the AMDGPU backend ensures the memory location used to spill is never
6224accessed by vector memory operations at the same time. If scalar writes are used
6225then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6226return since the locations may be used for vector memory instructions by a
6227future wavefront that uses the same scratch area, or a function call that
6228creates a frame at the same address, respectively. There is no need for a
6229``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6230
6231For kernarg backing memory:
6232
6233* CP invalidates the L1 cache at the start of each kernel dispatch.
6234* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6235  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6236  cache. This also causes it to be treated as non-volatile and so is not
6237  invalidated by ``*_vol``.
6238* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6239  so the L2 cache will be coherent with the CPU and other agents.
6240
6241Scratch backing memory (which is used for the private address space) is accessed
6242with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6243only accessed by a single thread, and is always write-before-read, there is
6244never a need to invalidate these entries from the L1 cache. Hence all cache
6245invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6246
6247The code sequences used to implement the memory model for GFX90A are defined
6248in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6249
6250  .. table:: AMDHSA Memory Model Code Sequences GFX90A
6251     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6252
6253     ============ ============ ============== ========== ================================
6254     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6255                  Ordering     Sync Scope     Address    GFX90A
6256                                              Space
6257     ============ ============ ============== ========== ================================
6258     **Non-Atomic**
6259     ------------------------------------------------------------------------------------
6260     load         *none*       *none*         - global   - !volatile & !nontemporal
6261                                              - generic
6262                                              - private    1. buffer/global/flat_load
6263                                              - constant
6264                                                         - !volatile & nontemporal
6265
6266                                                           1. buffer/global/flat_load
6267                                                              glc=1 slc=1
6268
6269                                                         - volatile
6270
6271                                                           1. buffer/global/flat_load
6272                                                              glc=1
6273                                                           2. s_waitcnt vmcnt(0)
6274
6275                                                            - Must happen before
6276                                                              any following volatile
6277                                                              global/generic
6278                                                              load/store.
6279                                                            - Ensures that
6280                                                              volatile
6281                                                              operations to
6282                                                              different
6283                                                              addresses will not
6284                                                              be reordered by
6285                                                              hardware.
6286
6287     load         *none*       *none*         - local    1. ds_load
6288     store        *none*       *none*         - global   - !volatile & !nontemporal
6289                                              - generic
6290                                              - private    1. buffer/global/flat_store
6291                                              - constant
6292                                                         - !volatile & nontemporal
6293
6294                                                           1. buffer/global/flat_store
6295                                                              glc=1 slc=1
6296
6297                                                         - volatile
6298
6299                                                           1. buffer/global/flat_store
6300                                                           2. s_waitcnt vmcnt(0)
6301
6302                                                            - Must happen before
6303                                                              any following volatile
6304                                                              global/generic
6305                                                              load/store.
6306                                                            - Ensures that
6307                                                              volatile
6308                                                              operations to
6309                                                              different
6310                                                              addresses will not
6311                                                              be reordered by
6312                                                              hardware.
6313
6314     store        *none*       *none*         - local    1. ds_store
6315     **Unordered Atomic**
6316     ------------------------------------------------------------------------------------
6317     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6318     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6319     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6320     **Monotonic Atomic**
6321     ------------------------------------------------------------------------------------
6322     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6323                               - wavefront    - generic
6324     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6325                                              - generic     glc=1
6326
6327                                                           - If not TgSplit execution
6328                                                             mode, omit glc=1.
6329
6330     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6331                               - wavefront               local address space cannot
6332                               - workgroup               be used.*
6333
6334                                                         1. ds_load
6335     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6336                                              - generic     glc=1
6337     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6338                                              - generic     glc=1
6339     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6340                               - wavefront    - generic
6341                               - workgroup
6342                               - agent
6343     store atomic monotonic    - system       - global   1. buffer/global/flat_store
6344                                              - generic
6345     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6346                               - wavefront               local address space cannot
6347                               - workgroup               be used.*
6348
6349                                                         1. ds_store
6350     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6351                               - wavefront    - generic
6352                               - workgroup
6353                               - agent
6354     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6355                                              - generic
6356     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6357                               - wavefront               local address space cannot
6358                               - workgroup               be used.*
6359
6360                                                         1. ds_atomic
6361     **Acquire Atomic**
6362     ------------------------------------------------------------------------------------
6363     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6364                               - wavefront    - local
6365                                              - generic
6366     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6367
6368                                                           - If not TgSplit execution
6369                                                             mode, omit glc=1.
6370
6371                                                         2. s_waitcnt vmcnt(0)
6372
6373                                                           - If not TgSplit execution
6374                                                             mode, omit.
6375                                                           - Must happen before the
6376                                                             following buffer_wbinvl1_vol.
6377
6378                                                         3. buffer_wbinvl1_vol
6379
6380                                                           - If not TgSplit execution
6381                                                             mode, omit.
6382                                                           - Must happen before
6383                                                             any following
6384                                                             global/generic
6385                                                             load/load
6386                                                             atomic/store/store
6387                                                             atomic/atomicrmw.
6388                                                           - Ensures that
6389                                                             following
6390                                                             loads will not see
6391                                                             stale data.
6392
6393     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6394                                                         local address space cannot
6395                                                         be used.*
6396
6397                                                         1. ds_load
6398                                                         2. s_waitcnt lgkmcnt(0)
6399
6400                                                           - If OpenCL, omit.
6401                                                           - Must happen before
6402                                                             any following
6403                                                             global/generic
6404                                                             load/load
6405                                                             atomic/store/store
6406                                                             atomic/atomicrmw.
6407                                                           - Ensures any
6408                                                             following global
6409                                                             data read is no
6410                                                             older than the local load
6411                                                             atomic value being
6412                                                             acquired.
6413
6414     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6415
6416                                                           - If not TgSplit execution
6417                                                             mode, omit glc=1.
6418
6419                                                         2. s_waitcnt lgkm/vmcnt(0)
6420
6421                                                           - Use lgkmcnt(0) if not
6422                                                             TgSplit execution mode
6423                                                             and vmcnt(0) if TgSplit
6424                                                             execution mode.
6425                                                           - If OpenCL, omit lgkmcnt(0).
6426                                                           - Must happen before
6427                                                             the following
6428                                                             buffer_wbinvl1_vol and any
6429                                                             following global/generic
6430                                                             load/load
6431                                                             atomic/store/store
6432                                                             atomic/atomicrmw.
6433                                                           - Ensures any
6434                                                             following global
6435                                                             data read is no
6436                                                             older than a local load
6437                                                             atomic value being
6438                                                             acquired.
6439
6440                                                         3. buffer_wbinvl1_vol
6441
6442                                                           - If not TgSplit execution
6443                                                             mode, omit.
6444                                                           - Ensures that
6445                                                             following
6446                                                             loads will not see
6447                                                             stale data.
6448
6449     load atomic  acquire      - agent        - global   1. buffer/global_load
6450                                                            glc=1
6451                                                         2. s_waitcnt vmcnt(0)
6452
6453                                                           - Must happen before
6454                                                             following
6455                                                             buffer_wbinvl1_vol.
6456                                                           - Ensures the load
6457                                                             has completed
6458                                                             before invalidating
6459                                                             the cache.
6460
6461                                                         3. buffer_wbinvl1_vol
6462
6463                                                           - Must happen before
6464                                                             any following
6465                                                             global/generic
6466                                                             load/load
6467                                                             atomic/atomicrmw.
6468                                                           - Ensures that
6469                                                             following
6470                                                             loads will not see
6471                                                             stale global data.
6472
6473     load atomic  acquire      - system       - global   1. buffer/global/flat_load
6474                                                            glc=1
6475                                                         2. s_waitcnt vmcnt(0)
6476
6477                                                           - Must happen before
6478                                                             following buffer_invl2 and
6479                                                             buffer_wbinvl1_vol.
6480                                                           - Ensures the load
6481                                                             has completed
6482                                                             before invalidating
6483                                                             the cache.
6484
6485                                                         3. buffer_invl2;
6486                                                            buffer_wbinvl1_vol
6487
6488                                                           - Must happen before
6489                                                             any following
6490                                                             global/generic
6491                                                             load/load
6492                                                             atomic/atomicrmw.
6493                                                           - Ensures that
6494                                                             following
6495                                                             loads will not see
6496                                                             stale L1 global data,
6497                                                             nor see stale L2 MTYPE
6498                                                             NC global data.
6499                                                             MTYPE RW and CC memory will
6500                                                             never be stale in L2 due to
6501                                                             the memory probes.
6502
6503     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6504                                                         2. s_waitcnt vmcnt(0) &
6505                                                            lgkmcnt(0)
6506
6507                                                           - If TgSplit execution mode,
6508                                                             omit lgkmcnt(0).
6509                                                           - If OpenCL omit
6510                                                             lgkmcnt(0).
6511                                                           - Must happen before
6512                                                             following
6513                                                             buffer_wbinvl1_vol.
6514                                                           - Ensures the flat_load
6515                                                             has completed
6516                                                             before invalidating
6517                                                             the cache.
6518
6519                                                         3. buffer_wbinvl1_vol
6520
6521                                                           - Must happen before
6522                                                             any following
6523                                                             global/generic
6524                                                             load/load
6525                                                             atomic/atomicrmw.
6526                                                           - Ensures that
6527                                                             following loads
6528                                                             will not see stale
6529                                                             global data.
6530
6531     load atomic  acquire      - system       - generic  1. flat_load glc=1
6532                                                         2. s_waitcnt vmcnt(0) &
6533                                                            lgkmcnt(0)
6534
6535                                                           - If TgSplit execution mode,
6536                                                             omit lgkmcnt(0).
6537                                                           - If OpenCL omit
6538                                                             lgkmcnt(0).
6539                                                           - Must happen before
6540                                                             following
6541                                                             buffer_invl2 and
6542                                                             buffer_wbinvl1_vol.
6543                                                           - Ensures the flat_load
6544                                                             has completed
6545                                                             before invalidating
6546                                                             the caches.
6547
6548                                                         3. buffer_invl2;
6549                                                            buffer_wbinvl1_vol
6550
6551                                                           - Must happen before
6552                                                             any following
6553                                                             global/generic
6554                                                             load/load
6555                                                             atomic/atomicrmw.
6556                                                           - Ensures that
6557                                                             following
6558                                                             loads will not see
6559                                                             stale L1 global data,
6560                                                             nor see stale L2 MTYPE
6561                                                             NC global data.
6562                                                             MTYPE RW and CC memory will
6563                                                             never be stale in L2 due to
6564                                                             the memory probes.
6565
6566     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6567                               - wavefront    - generic
6568     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6569                               - wavefront               local address space cannot
6570                                                         be used.*
6571
6572                                                         1. ds_atomic
6573     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6574                                                         2. s_waitcnt vmcnt(0)
6575
6576                                                           - If not TgSplit execution
6577                                                             mode, omit.
6578                                                           - Must happen before the
6579                                                             following buffer_wbinvl1_vol.
6580                                                           - Ensures the atomicrmw
6581                                                             has completed
6582                                                             before invalidating
6583                                                             the cache.
6584
6585                                                         3. buffer_wbinvl1_vol
6586
6587                                                           - If not TgSplit execution
6588                                                             mode, omit.
6589                                                           - Must happen before
6590                                                             any following
6591                                                             global/generic
6592                                                             load/load
6593                                                             atomic/atomicrmw.
6594                                                           - Ensures that
6595                                                             following loads
6596                                                             will not see stale
6597                                                             global data.
6598
6599     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6600                                                         local address space cannot
6601                                                         be used.*
6602
6603                                                         1. ds_atomic
6604                                                         2. s_waitcnt lgkmcnt(0)
6605
6606                                                           - If OpenCL, omit.
6607                                                           - Must happen before
6608                                                             any following
6609                                                             global/generic
6610                                                             load/load
6611                                                             atomic/store/store
6612                                                             atomic/atomicrmw.
6613                                                           - Ensures any
6614                                                             following global
6615                                                             data read is no
6616                                                             older than the local
6617                                                             atomicrmw value
6618                                                             being acquired.
6619
6620     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6621                                                         2. s_waitcnt lgkm/vmcnt(0)
6622
6623                                                           - Use lgkmcnt(0) if not
6624                                                             TgSplit execution mode
6625                                                             and vmcnt(0) if TgSplit
6626                                                             execution mode.
6627                                                           - If OpenCL, omit lgkmcnt(0).
6628                                                           - Must happen before
6629                                                             the following
6630                                                             buffer_wbinvl1_vol and
6631                                                             any following
6632                                                             global/generic
6633                                                             load/load
6634                                                             atomic/store/store
6635                                                             atomic/atomicrmw.
6636                                                           - Ensures any
6637                                                             following global
6638                                                             data read is no
6639                                                             older than a local
6640                                                             atomicrmw value
6641                                                             being acquired.
6642
6643                                                         3. buffer_wbinvl1_vol
6644
6645                                                           - If not TgSplit execution
6646                                                             mode, omit.
6647                                                           - Ensures that
6648                                                             following
6649                                                             loads will not see
6650                                                             stale data.
6651
6652     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6653                                                         2. s_waitcnt vmcnt(0)
6654
6655                                                           - Must happen before
6656                                                             following
6657                                                             buffer_wbinvl1_vol.
6658                                                           - Ensures the
6659                                                             atomicrmw has
6660                                                             completed before
6661                                                             invalidating the
6662                                                             cache.
6663
6664                                                         3. buffer_wbinvl1_vol
6665
6666                                                           - Must happen before
6667                                                             any following
6668                                                             global/generic
6669                                                             load/load
6670                                                             atomic/atomicrmw.
6671                                                           - Ensures that
6672                                                             following loads
6673                                                             will not see stale
6674                                                             global data.
6675
6676     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6677                                                         2. s_waitcnt vmcnt(0)
6678
6679                                                           - Must happen before
6680                                                             following buffer_invl2 and
6681                                                             buffer_wbinvl1_vol.
6682                                                           - Ensures the
6683                                                             atomicrmw has
6684                                                             completed before
6685                                                             invalidating the
6686                                                             caches.
6687
6688                                                         3. buffer_invl2;
6689                                                            buffer_wbinvl1_vol
6690
6691                                                           - Must happen before
6692                                                             any following
6693                                                             global/generic
6694                                                             load/load
6695                                                             atomic/atomicrmw.
6696                                                           - Ensures that
6697                                                             following
6698                                                             loads will not see
6699                                                             stale L1 global data,
6700                                                             nor see stale L2 MTYPE
6701                                                             NC global data.
6702                                                             MTYPE RW and CC memory will
6703                                                             never be stale in L2 due to
6704                                                             the memory probes.
6705
6706     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6707                                                         2. s_waitcnt vmcnt(0) &
6708                                                            lgkmcnt(0)
6709
6710                                                           - If TgSplit execution mode,
6711                                                             omit lgkmcnt(0).
6712                                                           - If OpenCL, omit
6713                                                             lgkmcnt(0).
6714                                                           - Must happen before
6715                                                             following
6716                                                             buffer_wbinvl1_vol.
6717                                                           - Ensures the
6718                                                             atomicrmw has
6719                                                             completed before
6720                                                             invalidating the
6721                                                             cache.
6722
6723                                                         3. buffer_wbinvl1_vol
6724
6725                                                           - Must happen before
6726                                                             any following
6727                                                             global/generic
6728                                                             load/load
6729                                                             atomic/atomicrmw.
6730                                                           - Ensures that
6731                                                             following loads
6732                                                             will not see stale
6733                                                             global data.
6734
6735     atomicrmw    acquire      - system       - generic  1. flat_atomic
6736                                                         2. s_waitcnt vmcnt(0) &
6737                                                            lgkmcnt(0)
6738
6739                                                           - If TgSplit execution mode,
6740                                                             omit lgkmcnt(0).
6741                                                           - If OpenCL, omit
6742                                                             lgkmcnt(0).
6743                                                           - Must happen before
6744                                                             following
6745                                                             buffer_invl2 and
6746                                                             buffer_wbinvl1_vol.
6747                                                           - Ensures the
6748                                                             atomicrmw has
6749                                                             completed before
6750                                                             invalidating the
6751                                                             caches.
6752
6753                                                         3. buffer_invl2;
6754                                                            buffer_wbinvl1_vol
6755
6756                                                           - Must happen before
6757                                                             any following
6758                                                             global/generic
6759                                                             load/load
6760                                                             atomic/atomicrmw.
6761                                                           - Ensures that
6762                                                             following
6763                                                             loads will not see
6764                                                             stale L1 global data,
6765                                                             nor see stale L2 MTYPE
6766                                                             NC global data.
6767                                                             MTYPE RW and CC memory will
6768                                                             never be stale in L2 due to
6769                                                             the memory probes.
6770
6771     fence        acquire      - singlethread *none*     *none*
6772                               - wavefront
6773     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
6774
6775                                                           - Use lgkmcnt(0) if not
6776                                                             TgSplit execution mode
6777                                                             and vmcnt(0) if TgSplit
6778                                                             execution mode.
6779                                                           - If OpenCL and
6780                                                             address space is
6781                                                             not generic, omit
6782                                                             lgkmcnt(0).
6783                                                           - If OpenCL and
6784                                                             address space is
6785                                                             local, omit
6786                                                             vmcnt(0).
6787                                                           - However, since LLVM
6788                                                             currently has no
6789                                                             address space on
6790                                                             the fence need to
6791                                                             conservatively
6792                                                             always generate. If
6793                                                             fence had an
6794                                                             address space then
6795                                                             set to address
6796                                                             space of OpenCL
6797                                                             fence flag, or to
6798                                                             generic if both
6799                                                             local and global
6800                                                             flags are
6801                                                             specified.
6802                                                           - s_waitcnt vmcnt(0)
6803                                                             must happen after
6804                                                             any preceding
6805                                                             global/generic load
6806                                                             atomic/
6807                                                             atomicrmw
6808                                                             with an equal or
6809                                                             wider sync scope
6810                                                             and memory ordering
6811                                                             stronger than
6812                                                             unordered (this is
6813                                                             termed the
6814                                                             fence-paired-atomic).
6815                                                           - s_waitcnt lgkmcnt(0)
6816                                                             must happen after
6817                                                             any preceding
6818                                                             local/generic load
6819                                                             atomic/atomicrmw
6820                                                             with an equal or
6821                                                             wider sync scope
6822                                                             and memory ordering
6823                                                             stronger than
6824                                                             unordered (this is
6825                                                             termed the
6826                                                             fence-paired-atomic).
6827                                                           - Must happen before
6828                                                             the following
6829                                                             buffer_wbinvl1_vol and
6830                                                             any following
6831                                                             global/generic
6832                                                             load/load
6833                                                             atomic/store/store
6834                                                             atomic/atomicrmw.
6835                                                           - Ensures any
6836                                                             following global
6837                                                             data read is no
6838                                                             older than the
6839                                                             value read by the
6840                                                             fence-paired-atomic.
6841
6842                                                         2. buffer_wbinvl1_vol
6843
6844                                                           - If not TgSplit execution
6845                                                             mode, omit.
6846                                                           - Ensures that
6847                                                             following
6848                                                             loads will not see
6849                                                             stale data.
6850
6851     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6852                                                            vmcnt(0)
6853
6854                                                           - If TgSplit execution mode,
6855                                                             omit lgkmcnt(0).
6856                                                           - If OpenCL and
6857                                                             address space is
6858                                                             not generic, omit
6859                                                             lgkmcnt(0).
6860                                                           - However, since LLVM
6861                                                             currently has no
6862                                                             address space on
6863                                                             the fence need to
6864                                                             conservatively
6865                                                             always generate
6866                                                             (see comment for
6867                                                             previous fence).
6868                                                           - Could be split into
6869                                                             separate s_waitcnt
6870                                                             vmcnt(0) and
6871                                                             s_waitcnt
6872                                                             lgkmcnt(0) to allow
6873                                                             them to be
6874                                                             independently moved
6875                                                             according to the
6876                                                             following rules.
6877                                                           - s_waitcnt vmcnt(0)
6878                                                             must happen after
6879                                                             any preceding
6880                                                             global/generic load
6881                                                             atomic/atomicrmw
6882                                                             with an equal or
6883                                                             wider sync scope
6884                                                             and memory ordering
6885                                                             stronger than
6886                                                             unordered (this is
6887                                                             termed the
6888                                                             fence-paired-atomic).
6889                                                           - s_waitcnt lgkmcnt(0)
6890                                                             must happen after
6891                                                             any preceding
6892                                                             local/generic load
6893                                                             atomic/atomicrmw
6894                                                             with an equal or
6895                                                             wider sync scope
6896                                                             and memory ordering
6897                                                             stronger than
6898                                                             unordered (this is
6899                                                             termed the
6900                                                             fence-paired-atomic).
6901                                                           - Must happen before
6902                                                             the following
6903                                                             buffer_wbinvl1_vol.
6904                                                           - Ensures that the
6905                                                             fence-paired atomic
6906                                                             has completed
6907                                                             before invalidating
6908                                                             the
6909                                                             cache. Therefore
6910                                                             any following
6911                                                             locations read must
6912                                                             be no older than
6913                                                             the value read by
6914                                                             the
6915                                                             fence-paired-atomic.
6916
6917                                                         2. buffer_wbinvl1_vol
6918
6919                                                           - Must happen before any
6920                                                             following global/generic
6921                                                             load/load
6922                                                             atomic/store/store
6923                                                             atomic/atomicrmw.
6924                                                           - Ensures that
6925                                                             following loads
6926                                                             will not see stale
6927                                                             global data.
6928
6929     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
6930                                                            vmcnt(0)
6931
6932                                                           - If TgSplit execution mode,
6933                                                             omit lgkmcnt(0).
6934                                                           - If OpenCL and
6935                                                             address space is
6936                                                             not generic, omit
6937                                                             lgkmcnt(0).
6938                                                           - However, since LLVM
6939                                                             currently has no
6940                                                             address space on
6941                                                             the fence need to
6942                                                             conservatively
6943                                                             always generate
6944                                                             (see comment for
6945                                                             previous fence).
6946                                                           - Could be split into
6947                                                             separate s_waitcnt
6948                                                             vmcnt(0) and
6949                                                             s_waitcnt
6950                                                             lgkmcnt(0) to allow
6951                                                             them to be
6952                                                             independently moved
6953                                                             according to the
6954                                                             following rules.
6955                                                           - s_waitcnt vmcnt(0)
6956                                                             must happen after
6957                                                             any preceding
6958                                                             global/generic load
6959                                                             atomic/atomicrmw
6960                                                             with an equal or
6961                                                             wider sync scope
6962                                                             and memory ordering
6963                                                             stronger than
6964                                                             unordered (this is
6965                                                             termed the
6966                                                             fence-paired-atomic).
6967                                                           - s_waitcnt lgkmcnt(0)
6968                                                             must happen after
6969                                                             any preceding
6970                                                             local/generic load
6971                                                             atomic/atomicrmw
6972                                                             with an equal or
6973                                                             wider sync scope
6974                                                             and memory ordering
6975                                                             stronger than
6976                                                             unordered (this is
6977                                                             termed the
6978                                                             fence-paired-atomic).
6979                                                           - Must happen before
6980                                                             the following buffer_invl2 and
6981                                                             buffer_wbinvl1_vol.
6982                                                           - Ensures that the
6983                                                             fence-paired atomic
6984                                                             has completed
6985                                                             before invalidating
6986                                                             the
6987                                                             cache. Therefore
6988                                                             any following
6989                                                             locations read must
6990                                                             be no older than
6991                                                             the value read by
6992                                                             the
6993                                                             fence-paired-atomic.
6994
6995                                                         2. buffer_invl2;
6996                                                            buffer_wbinvl1_vol
6997
6998                                                           - Must happen before any
6999                                                             following global/generic
7000                                                             load/load
7001                                                             atomic/store/store
7002                                                             atomic/atomicrmw.
7003                                                           - Ensures that
7004                                                             following
7005                                                             loads will not see
7006                                                             stale L1 global data,
7007                                                             nor see stale L2 MTYPE
7008                                                             NC global data.
7009                                                             MTYPE RW and CC memory will
7010                                                             never be stale in L2 due to
7011                                                             the memory probes.
7012     **Release Atomic**
7013     ------------------------------------------------------------------------------------
7014     store atomic release      - singlethread - global   1. buffer/global/flat_store
7015                               - wavefront    - generic
7016     store atomic release      - singlethread - local    *If TgSplit execution mode,
7017                               - wavefront               local address space cannot
7018                                                         be used.*
7019
7020                                                         1. ds_store
7021     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7022                                              - generic
7023                                                           - Use lgkmcnt(0) if not
7024                                                             TgSplit execution mode
7025                                                             and vmcnt(0) if TgSplit
7026                                                             execution mode.
7027                                                           - If OpenCL, omit lgkmcnt(0).
7028                                                           - s_waitcnt vmcnt(0)
7029                                                             must happen after
7030                                                             any preceding
7031                                                             global/generic load/store/
7032                                                             load atomic/store atomic/
7033                                                             atomicrmw.
7034                                                           - s_waitcnt lgkmcnt(0)
7035                                                             must happen after
7036                                                             any preceding
7037                                                             local/generic
7038                                                             load/store/load
7039                                                             atomic/store
7040                                                             atomic/atomicrmw.
7041                                                           - Must happen before
7042                                                             the following
7043                                                             store.
7044                                                           - Ensures that all
7045                                                             memory operations
7046                                                             have
7047                                                             completed before
7048                                                             performing the
7049                                                             store that is being
7050                                                             released.
7051
7052                                                         2. buffer/global/flat_store
7053     store atomic release      - workgroup    - local    *If TgSplit execution mode,
7054                                                         local address space cannot
7055                                                         be used.*
7056
7057                                                         1. ds_store
7058     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7059                                              - generic     vmcnt(0)
7060
7061                                                           - If TgSplit execution mode,
7062                                                             omit lgkmcnt(0).
7063                                                           - If OpenCL and
7064                                                             address space is
7065                                                             not generic, omit
7066                                                             lgkmcnt(0).
7067                                                           - Could be split into
7068                                                             separate s_waitcnt
7069                                                             vmcnt(0) and
7070                                                             s_waitcnt
7071                                                             lgkmcnt(0) to allow
7072                                                             them to be
7073                                                             independently moved
7074                                                             according to the
7075                                                             following rules.
7076                                                           - s_waitcnt vmcnt(0)
7077                                                             must happen after
7078                                                             any preceding
7079                                                             global/generic
7080                                                             load/store/load
7081                                                             atomic/store
7082                                                             atomic/atomicrmw.
7083                                                           - s_waitcnt lgkmcnt(0)
7084                                                             must happen after
7085                                                             any preceding
7086                                                             local/generic
7087                                                             load/store/load
7088                                                             atomic/store
7089                                                             atomic/atomicrmw.
7090                                                           - Must happen before
7091                                                             the following
7092                                                             store.
7093                                                           - Ensures that all
7094                                                             memory operations
7095                                                             to memory have
7096                                                             completed before
7097                                                             performing the
7098                                                             store that is being
7099                                                             released.
7100
7101                                                         2. buffer/global/flat_store
7102     store atomic release      - system       - global   1. buffer_wbl2
7103                                              - generic
7104                                                           - Must happen before
7105                                                             following s_waitcnt.
7106                                                           - Performs L2 writeback to
7107                                                             ensure previous
7108                                                             global/generic
7109                                                             store/atomicrmw are
7110                                                             visible at system scope.
7111
7112                                                         2. s_waitcnt lgkmcnt(0) &
7113                                                            vmcnt(0)
7114
7115                                                           - If TgSplit execution mode,
7116                                                             omit lgkmcnt(0).
7117                                                           - If OpenCL and
7118                                                             address space is
7119                                                             not generic, omit
7120                                                             lgkmcnt(0).
7121                                                           - Could be split into
7122                                                             separate s_waitcnt
7123                                                             vmcnt(0) and
7124                                                             s_waitcnt
7125                                                             lgkmcnt(0) to allow
7126                                                             them to be
7127                                                             independently moved
7128                                                             according to the
7129                                                             following rules.
7130                                                           - s_waitcnt vmcnt(0)
7131                                                             must happen after any
7132                                                             preceding
7133                                                             global/generic
7134                                                             load/store/load
7135                                                             atomic/store
7136                                                             atomic/atomicrmw.
7137                                                           - s_waitcnt lgkmcnt(0)
7138                                                             must happen after any
7139                                                             preceding
7140                                                             local/generic
7141                                                             load/store/load
7142                                                             atomic/store
7143                                                             atomic/atomicrmw.
7144                                                           - Must happen before
7145                                                             the following
7146                                                             store.
7147                                                           - Ensures that all
7148                                                             memory operations
7149                                                             to memory and the L2
7150                                                             writeback have
7151                                                             completed before
7152                                                             performing the
7153                                                             store that is being
7154                                                             released.
7155
7156                                                         3. buffer/global/flat_store
7157     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7158                               - wavefront    - generic
7159     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7160                               - wavefront               local address space cannot
7161                                                         be used.*
7162
7163                                                         1. ds_atomic
7164     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7165                                              - generic
7166                                                           - Use lgkmcnt(0) if not
7167                                                             TgSplit execution mode
7168                                                             and vmcnt(0) if TgSplit
7169                                                             execution mode.
7170                                                           - If OpenCL, omit
7171                                                             lgkmcnt(0).
7172                                                           - s_waitcnt vmcnt(0)
7173                                                             must happen after
7174                                                             any preceding
7175                                                             global/generic load/store/
7176                                                             load atomic/store atomic/
7177                                                             atomicrmw.
7178                                                           - s_waitcnt lgkmcnt(0)
7179                                                             must happen after
7180                                                             any preceding
7181                                                             local/generic
7182                                                             load/store/load
7183                                                             atomic/store
7184                                                             atomic/atomicrmw.
7185                                                           - Must happen before
7186                                                             the following
7187                                                             atomicrmw.
7188                                                           - Ensures that all
7189                                                             memory operations
7190                                                             have
7191                                                             completed before
7192                                                             performing the
7193                                                             atomicrmw that is
7194                                                             being released.
7195
7196                                                         2. buffer/global/flat_atomic
7197     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7198                                                         local address space cannot
7199                                                         be used.*
7200
7201                                                         1. ds_atomic
7202     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7203                                              - generic     vmcnt(0)
7204
7205                                                           - If TgSplit execution mode,
7206                                                             omit lgkmcnt(0).
7207                                                           - If OpenCL, omit
7208                                                             lgkmcnt(0).
7209                                                           - Could be split into
7210                                                             separate s_waitcnt
7211                                                             vmcnt(0) and
7212                                                             s_waitcnt
7213                                                             lgkmcnt(0) to allow
7214                                                             them to be
7215                                                             independently moved
7216                                                             according to the
7217                                                             following rules.
7218                                                           - s_waitcnt vmcnt(0)
7219                                                             must happen after
7220                                                             any preceding
7221                                                             global/generic
7222                                                             load/store/load
7223                                                             atomic/store
7224                                                             atomic/atomicrmw.
7225                                                           - s_waitcnt lgkmcnt(0)
7226                                                             must happen after
7227                                                             any preceding
7228                                                             local/generic
7229                                                             load/store/load
7230                                                             atomic/store
7231                                                             atomic/atomicrmw.
7232                                                           - Must happen before
7233                                                             the following
7234                                                             atomicrmw.
7235                                                           - Ensures that all
7236                                                             memory operations
7237                                                             to global and local
7238                                                             have completed
7239                                                             before performing
7240                                                             the atomicrmw that
7241                                                             is being released.
7242
7243                                                         2. buffer/global/flat_atomic
7244     atomicrmw    release      - system       - global   1. buffer_wbl2
7245                                              - generic
7246                                                           - Must happen before
7247                                                             following s_waitcnt.
7248                                                           - Performs L2 writeback to
7249                                                             ensure previous
7250                                                             global/generic
7251                                                             store/atomicrmw are
7252                                                             visible at system scope.
7253
7254                                                         2. s_waitcnt lgkmcnt(0) &
7255                                                            vmcnt(0)
7256
7257                                                           - If TgSplit execution mode,
7258                                                             omit lgkmcnt(0).
7259                                                           - If OpenCL, omit
7260                                                             lgkmcnt(0).
7261                                                           - Could be split into
7262                                                             separate s_waitcnt
7263                                                             vmcnt(0) and
7264                                                             s_waitcnt
7265                                                             lgkmcnt(0) to allow
7266                                                             them to be
7267                                                             independently moved
7268                                                             according to the
7269                                                             following rules.
7270                                                           - s_waitcnt vmcnt(0)
7271                                                             must happen after
7272                                                             any preceding
7273                                                             global/generic
7274                                                             load/store/load
7275                                                             atomic/store
7276                                                             atomic/atomicrmw.
7277                                                           - s_waitcnt lgkmcnt(0)
7278                                                             must happen after
7279                                                             any preceding
7280                                                             local/generic
7281                                                             load/store/load
7282                                                             atomic/store
7283                                                             atomic/atomicrmw.
7284                                                           - Must happen before
7285                                                             the following
7286                                                             atomicrmw.
7287                                                           - Ensures that all
7288                                                             memory operations
7289                                                             to memory and the L2
7290                                                             writeback have
7291                                                             completed before
7292                                                             performing the
7293                                                             store that is being
7294                                                             released.
7295
7296                                                         3. buffer/global/flat_atomic
7297     fence        release      - singlethread *none*     *none*
7298                               - wavefront
7299     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7300
7301                                                           - Use lgkmcnt(0) if not
7302                                                             TgSplit execution mode
7303                                                             and vmcnt(0) if TgSplit
7304                                                             execution mode.
7305                                                           - If OpenCL and
7306                                                             address space is
7307                                                             not generic, omit
7308                                                             lgkmcnt(0).
7309                                                           - If OpenCL and
7310                                                             address space is
7311                                                             local, omit
7312                                                             vmcnt(0).
7313                                                           - However, since LLVM
7314                                                             currently has no
7315                                                             address space on
7316                                                             the fence need to
7317                                                             conservatively
7318                                                             always generate. If
7319                                                             fence had an
7320                                                             address space then
7321                                                             set to address
7322                                                             space of OpenCL
7323                                                             fence flag, or to
7324                                                             generic if both
7325                                                             local and global
7326                                                             flags are
7327                                                             specified.
7328                                                           - s_waitcnt vmcnt(0)
7329                                                             must happen after
7330                                                             any preceding
7331                                                             global/generic
7332                                                             load/store/
7333                                                             load atomic/store atomic/
7334                                                             atomicrmw.
7335                                                           - s_waitcnt lgkmcnt(0)
7336                                                             must happen after
7337                                                             any preceding
7338                                                             local/generic
7339                                                             load/load
7340                                                             atomic/store/store
7341                                                             atomic/atomicrmw.
7342                                                           - Must happen before
7343                                                             any following store
7344                                                             atomic/atomicrmw
7345                                                             with an equal or
7346                                                             wider sync scope
7347                                                             and memory ordering
7348                                                             stronger than
7349                                                             unordered (this is
7350                                                             termed the
7351                                                             fence-paired-atomic).
7352                                                           - Ensures that all
7353                                                             memory operations
7354                                                             have
7355                                                             completed before
7356                                                             performing the
7357                                                             following
7358                                                             fence-paired-atomic.
7359
7360     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7361                                                            vmcnt(0)
7362
7363                                                           - If TgSplit execution mode,
7364                                                             omit lgkmcnt(0).
7365                                                           - If OpenCL and
7366                                                             address space is
7367                                                             not generic, omit
7368                                                             lgkmcnt(0).
7369                                                           - If OpenCL and
7370                                                             address space is
7371                                                             local, omit
7372                                                             vmcnt(0).
7373                                                           - However, since LLVM
7374                                                             currently has no
7375                                                             address space on
7376                                                             the fence need to
7377                                                             conservatively
7378                                                             always generate. If
7379                                                             fence had an
7380                                                             address space then
7381                                                             set to address
7382                                                             space of OpenCL
7383                                                             fence flag, or to
7384                                                             generic if both
7385                                                             local and global
7386                                                             flags are
7387                                                             specified.
7388                                                           - Could be split into
7389                                                             separate s_waitcnt
7390                                                             vmcnt(0) and
7391                                                             s_waitcnt
7392                                                             lgkmcnt(0) to allow
7393                                                             them to be
7394                                                             independently moved
7395                                                             according to the
7396                                                             following rules.
7397                                                           - s_waitcnt vmcnt(0)
7398                                                             must happen after
7399                                                             any preceding
7400                                                             global/generic
7401                                                             load/store/load
7402                                                             atomic/store
7403                                                             atomic/atomicrmw.
7404                                                           - s_waitcnt lgkmcnt(0)
7405                                                             must happen after
7406                                                             any preceding
7407                                                             local/generic
7408                                                             load/store/load
7409                                                             atomic/store
7410                                                             atomic/atomicrmw.
7411                                                           - Must happen before
7412                                                             any following store
7413                                                             atomic/atomicrmw
7414                                                             with an equal or
7415                                                             wider sync scope
7416                                                             and memory ordering
7417                                                             stronger than
7418                                                             unordered (this is
7419                                                             termed the
7420                                                             fence-paired-atomic).
7421                                                           - Ensures that all
7422                                                             memory operations
7423                                                             have
7424                                                             completed before
7425                                                             performing the
7426                                                             following
7427                                                             fence-paired-atomic.
7428
7429     fence        release      - system       *none*     1. buffer_wbl2
7430
7431                                                           - If OpenCL and
7432                                                             address space is
7433                                                             local, omit.
7434                                                           - Must happen before
7435                                                             following s_waitcnt.
7436                                                           - Performs L2 writeback to
7437                                                             ensure previous
7438                                                             global/generic
7439                                                             store/atomicrmw are
7440                                                             visible at system scope.
7441
7442                                                         2. s_waitcnt lgkmcnt(0) &
7443                                                            vmcnt(0)
7444
7445                                                           - If TgSplit execution mode,
7446                                                             omit lgkmcnt(0).
7447                                                           - If OpenCL and
7448                                                             address space is
7449                                                             not generic, omit
7450                                                             lgkmcnt(0).
7451                                                           - If OpenCL and
7452                                                             address space is
7453                                                             local, omit
7454                                                             vmcnt(0).
7455                                                           - However, since LLVM
7456                                                             currently has no
7457                                                             address space on
7458                                                             the fence need to
7459                                                             conservatively
7460                                                             always generate. If
7461                                                             fence had an
7462                                                             address space then
7463                                                             set to address
7464                                                             space of OpenCL
7465                                                             fence flag, or to
7466                                                             generic if both
7467                                                             local and global
7468                                                             flags are
7469                                                             specified.
7470                                                           - Could be split into
7471                                                             separate s_waitcnt
7472                                                             vmcnt(0) and
7473                                                             s_waitcnt
7474                                                             lgkmcnt(0) to allow
7475                                                             them to be
7476                                                             independently moved
7477                                                             according to the
7478                                                             following rules.
7479                                                           - s_waitcnt vmcnt(0)
7480                                                             must happen after
7481                                                             any preceding
7482                                                             global/generic
7483                                                             load/store/load
7484                                                             atomic/store
7485                                                             atomic/atomicrmw.
7486                                                           - s_waitcnt lgkmcnt(0)
7487                                                             must happen after
7488                                                             any preceding
7489                                                             local/generic
7490                                                             load/store/load
7491                                                             atomic/store
7492                                                             atomic/atomicrmw.
7493                                                           - Must happen before
7494                                                             any following store
7495                                                             atomic/atomicrmw
7496                                                             with an equal or
7497                                                             wider sync scope
7498                                                             and memory ordering
7499                                                             stronger than
7500                                                             unordered (this is
7501                                                             termed the
7502                                                             fence-paired-atomic).
7503                                                           - Ensures that all
7504                                                             memory operations
7505                                                             have
7506                                                             completed before
7507                                                             performing the
7508                                                             following
7509                                                             fence-paired-atomic.
7510
7511     **Acquire-Release Atomic**
7512     ------------------------------------------------------------------------------------
7513     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7514                               - wavefront    - generic
7515     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7516                               - wavefront               local address space cannot
7517                                                         be used.*
7518
7519                                                         1. ds_atomic
7520     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7521
7522                                                           - Use lgkmcnt(0) if not
7523                                                             TgSplit execution mode
7524                                                             and vmcnt(0) if TgSplit
7525                                                             execution mode.
7526                                                           - If OpenCL, omit
7527                                                             lgkmcnt(0).
7528                                                           - Must happen after
7529                                                             any preceding
7530                                                             local/generic
7531                                                             load/store/load
7532                                                             atomic/store
7533                                                             atomic/atomicrmw.
7534                                                           - s_waitcnt vmcnt(0)
7535                                                             must happen after
7536                                                             any preceding
7537                                                             global/generic load/store/
7538                                                             load atomic/store atomic/
7539                                                             atomicrmw.
7540                                                           - s_waitcnt lgkmcnt(0)
7541                                                             must happen after
7542                                                             any preceding
7543                                                             local/generic
7544                                                             load/store/load
7545                                                             atomic/store
7546                                                             atomic/atomicrmw.
7547                                                           - Must happen before
7548                                                             the following
7549                                                             atomicrmw.
7550                                                           - Ensures that all
7551                                                             memory operations
7552                                                             have
7553                                                             completed before
7554                                                             performing the
7555                                                             atomicrmw that is
7556                                                             being released.
7557
7558                                                         2. buffer/global_atomic
7559                                                         3. s_waitcnt vmcnt(0)
7560
7561                                                           - If not TgSplit execution
7562                                                             mode, omit.
7563                                                           - Must happen before
7564                                                             the following
7565                                                             buffer_wbinvl1_vol.
7566                                                           - Ensures any
7567                                                             following global
7568                                                             data read is no
7569                                                             older than the
7570                                                             atomicrmw value
7571                                                             being acquired.
7572
7573                                                         4. buffer_wbinvl1_vol
7574
7575                                                           - If not TgSplit execution
7576                                                             mode, omit.
7577                                                           - Ensures that
7578                                                             following
7579                                                             loads will not see
7580                                                             stale data.
7581
7582     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7583                                                         local address space cannot
7584                                                         be used.*
7585
7586                                                         1. ds_atomic
7587                                                         2. s_waitcnt lgkmcnt(0)
7588
7589                                                           - If OpenCL, omit.
7590                                                           - Must happen before
7591                                                             any following
7592                                                             global/generic
7593                                                             load/load
7594                                                             atomic/store/store
7595                                                             atomic/atomicrmw.
7596                                                           - Ensures any
7597                                                             following global
7598                                                             data read is no
7599                                                             older than the local load
7600                                                             atomic value being
7601                                                             acquired.
7602
7603     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7604
7605                                                           - Use lgkmcnt(0) if not
7606                                                             TgSplit execution mode
7607                                                             and vmcnt(0) if TgSplit
7608                                                             execution mode.
7609                                                           - If OpenCL, omit
7610                                                             lgkmcnt(0).
7611                                                           - s_waitcnt vmcnt(0)
7612                                                             must happen after
7613                                                             any preceding
7614                                                             global/generic load/store/
7615                                                             load atomic/store atomic/
7616                                                             atomicrmw.
7617                                                           - s_waitcnt lgkmcnt(0)
7618                                                             must happen after
7619                                                             any preceding
7620                                                             local/generic
7621                                                             load/store/load
7622                                                             atomic/store
7623                                                             atomic/atomicrmw.
7624                                                           - Must happen before
7625                                                             the following
7626                                                             atomicrmw.
7627                                                           - Ensures that all
7628                                                             memory operations
7629                                                             have
7630                                                             completed before
7631                                                             performing the
7632                                                             atomicrmw that is
7633                                                             being released.
7634
7635                                                         2. flat_atomic
7636                                                         3. s_waitcnt lgkmcnt(0) &
7637                                                            vmcnt(0)
7638
7639                                                           - If not TgSplit execution
7640                                                             mode, omit vmcnt(0).
7641                                                           - If OpenCL, omit
7642                                                             lgkmcnt(0).
7643                                                           - Must happen before
7644                                                             the following
7645                                                             buffer_wbinvl1_vol and
7646                                                             any following
7647                                                             global/generic
7648                                                             load/load
7649                                                             atomic/store/store
7650                                                             atomic/atomicrmw.
7651                                                           - Ensures any
7652                                                             following global
7653                                                             data read is no
7654                                                             older than a local load
7655                                                             atomic value being
7656                                                             acquired.
7657
7658                                                         3. buffer_wbinvl1_vol
7659
7660                                                           - If not TgSplit execution
7661                                                             mode, omit.
7662                                                           - Ensures that
7663                                                             following
7664                                                             loads will not see
7665                                                             stale data.
7666
7667     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7668                                                            vmcnt(0)
7669
7670                                                           - If TgSplit execution mode,
7671                                                             omit lgkmcnt(0).
7672                                                           - If OpenCL, omit
7673                                                             lgkmcnt(0).
7674                                                           - Could be split into
7675                                                             separate s_waitcnt
7676                                                             vmcnt(0) and
7677                                                             s_waitcnt
7678                                                             lgkmcnt(0) to allow
7679                                                             them to be
7680                                                             independently moved
7681                                                             according to the
7682                                                             following rules.
7683                                                           - s_waitcnt vmcnt(0)
7684                                                             must happen after
7685                                                             any preceding
7686                                                             global/generic
7687                                                             load/store/load
7688                                                             atomic/store
7689                                                             atomic/atomicrmw.
7690                                                           - s_waitcnt lgkmcnt(0)
7691                                                             must happen after
7692                                                             any preceding
7693                                                             local/generic
7694                                                             load/store/load
7695                                                             atomic/store
7696                                                             atomic/atomicrmw.
7697                                                           - Must happen before
7698                                                             the following
7699                                                             atomicrmw.
7700                                                           - Ensures that all
7701                                                             memory operations
7702                                                             to global have
7703                                                             completed before
7704                                                             performing the
7705                                                             atomicrmw that is
7706                                                             being released.
7707
7708                                                         2. buffer/global_atomic
7709                                                         3. s_waitcnt vmcnt(0)
7710
7711                                                           - Must happen before
7712                                                             following
7713                                                             buffer_wbinvl1_vol.
7714                                                           - Ensures the
7715                                                             atomicrmw has
7716                                                             completed before
7717                                                             invalidating the
7718                                                             cache.
7719
7720                                                         4. buffer_wbinvl1_vol
7721
7722                                                           - Must happen before
7723                                                             any following
7724                                                             global/generic
7725                                                             load/load
7726                                                             atomic/atomicrmw.
7727                                                           - Ensures that
7728                                                             following loads
7729                                                             will not see stale
7730                                                             global data.
7731
7732     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7733
7734                                                           - Must happen before
7735                                                             following s_waitcnt.
7736                                                           - Performs L2 writeback to
7737                                                             ensure previous
7738                                                             global/generic
7739                                                             store/atomicrmw are
7740                                                             visible at system scope.
7741
7742                                                         2. s_waitcnt lgkmcnt(0) &
7743                                                            vmcnt(0)
7744
7745                                                           - If TgSplit execution mode,
7746                                                             omit lgkmcnt(0).
7747                                                           - If OpenCL, omit
7748                                                             lgkmcnt(0).
7749                                                           - Could be split into
7750                                                             separate s_waitcnt
7751                                                             vmcnt(0) and
7752                                                             s_waitcnt
7753                                                             lgkmcnt(0) to allow
7754                                                             them to be
7755                                                             independently moved
7756                                                             according to the
7757                                                             following rules.
7758                                                           - s_waitcnt vmcnt(0)
7759                                                             must happen after
7760                                                             any preceding
7761                                                             global/generic
7762                                                             load/store/load
7763                                                             atomic/store
7764                                                             atomic/atomicrmw.
7765                                                           - s_waitcnt lgkmcnt(0)
7766                                                             must happen after
7767                                                             any preceding
7768                                                             local/generic
7769                                                             load/store/load
7770                                                             atomic/store
7771                                                             atomic/atomicrmw.
7772                                                           - Must happen before
7773                                                             the following
7774                                                             atomicrmw.
7775                                                           - Ensures that all
7776                                                             memory operations
7777                                                             to global and L2 writeback
7778                                                             have completed before
7779                                                             performing the
7780                                                             atomicrmw that is
7781                                                             being released.
7782
7783                                                         3. buffer/global_atomic
7784                                                         4. s_waitcnt vmcnt(0)
7785
7786                                                           - Must happen before
7787                                                             following buffer_invl2 and
7788                                                             buffer_wbinvl1_vol.
7789                                                           - Ensures the
7790                                                             atomicrmw has
7791                                                             completed before
7792                                                             invalidating the
7793                                                             caches.
7794
7795                                                         5. buffer_invl2;
7796                                                            buffer_wbinvl1_vol
7797
7798                                                           - Must happen before
7799                                                             any following
7800                                                             global/generic
7801                                                             load/load
7802                                                             atomic/atomicrmw.
7803                                                           - Ensures that
7804                                                             following
7805                                                             loads will not see
7806                                                             stale L1 global data,
7807                                                             nor see stale L2 MTYPE
7808                                                             NC global data.
7809                                                             MTYPE RW and CC memory will
7810                                                             never be stale in L2 due to
7811                                                             the memory probes.
7812
7813     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
7814                                                            vmcnt(0)
7815
7816                                                           - If TgSplit execution mode,
7817                                                             omit lgkmcnt(0).
7818                                                           - If OpenCL, omit
7819                                                             lgkmcnt(0).
7820                                                           - Could be split into
7821                                                             separate s_waitcnt
7822                                                             vmcnt(0) and
7823                                                             s_waitcnt
7824                                                             lgkmcnt(0) to allow
7825                                                             them to be
7826                                                             independently moved
7827                                                             according to the
7828                                                             following rules.
7829                                                           - s_waitcnt vmcnt(0)
7830                                                             must happen after
7831                                                             any preceding
7832                                                             global/generic
7833                                                             load/store/load
7834                                                             atomic/store
7835                                                             atomic/atomicrmw.
7836                                                           - s_waitcnt lgkmcnt(0)
7837                                                             must happen after
7838                                                             any preceding
7839                                                             local/generic
7840                                                             load/store/load
7841                                                             atomic/store
7842                                                             atomic/atomicrmw.
7843                                                           - Must happen before
7844                                                             the following
7845                                                             atomicrmw.
7846                                                           - Ensures that all
7847                                                             memory operations
7848                                                             to global have
7849                                                             completed before
7850                                                             performing the
7851                                                             atomicrmw that is
7852                                                             being released.
7853
7854                                                         2. flat_atomic
7855                                                         3. s_waitcnt vmcnt(0) &
7856                                                            lgkmcnt(0)
7857
7858                                                           - If TgSplit execution mode,
7859                                                             omit lgkmcnt(0).
7860                                                           - If OpenCL, omit
7861                                                             lgkmcnt(0).
7862                                                           - Must happen before
7863                                                             following
7864                                                             buffer_wbinvl1_vol.
7865                                                           - Ensures the
7866                                                             atomicrmw has
7867                                                             completed before
7868                                                             invalidating the
7869                                                             cache.
7870
7871                                                         4. buffer_wbinvl1_vol
7872
7873                                                           - Must happen before
7874                                                             any following
7875                                                             global/generic
7876                                                             load/load
7877                                                             atomic/atomicrmw.
7878                                                           - Ensures that
7879                                                             following loads
7880                                                             will not see stale
7881                                                             global data.
7882
7883     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
7884
7885                                                           - Must happen before
7886                                                             following s_waitcnt.
7887                                                           - Performs L2 writeback to
7888                                                             ensure previous
7889                                                             global/generic
7890                                                             store/atomicrmw are
7891                                                             visible at system scope.
7892
7893                                                         2. s_waitcnt lgkmcnt(0) &
7894                                                            vmcnt(0)
7895
7896                                                           - If TgSplit execution mode,
7897                                                             omit lgkmcnt(0).
7898                                                           - If OpenCL, omit
7899                                                             lgkmcnt(0).
7900                                                           - Could be split into
7901                                                             separate s_waitcnt
7902                                                             vmcnt(0) and
7903                                                             s_waitcnt
7904                                                             lgkmcnt(0) to allow
7905                                                             them to be
7906                                                             independently moved
7907                                                             according to the
7908                                                             following rules.
7909                                                           - s_waitcnt vmcnt(0)
7910                                                             must happen after
7911                                                             any preceding
7912                                                             global/generic
7913                                                             load/store/load
7914                                                             atomic/store
7915                                                             atomic/atomicrmw.
7916                                                           - s_waitcnt lgkmcnt(0)
7917                                                             must happen after
7918                                                             any preceding
7919                                                             local/generic
7920                                                             load/store/load
7921                                                             atomic/store
7922                                                             atomic/atomicrmw.
7923                                                           - Must happen before
7924                                                             the following
7925                                                             atomicrmw.
7926                                                           - Ensures that all
7927                                                             memory operations
7928                                                             to global and L2 writeback
7929                                                             have completed before
7930                                                             performing the
7931                                                             atomicrmw that is
7932                                                             being released.
7933
7934                                                         3. flat_atomic
7935                                                         4. s_waitcnt vmcnt(0) &
7936                                                            lgkmcnt(0)
7937
7938                                                           - If TgSplit execution mode,
7939                                                             omit lgkmcnt(0).
7940                                                           - If OpenCL, omit
7941                                                             lgkmcnt(0).
7942                                                           - Must happen before
7943                                                             following buffer_invl2 and
7944                                                             buffer_wbinvl1_vol.
7945                                                           - Ensures the
7946                                                             atomicrmw has
7947                                                             completed before
7948                                                             invalidating the
7949                                                             caches.
7950
7951                                                         5. buffer_invl2;
7952                                                            buffer_wbinvl1_vol
7953
7954                                                           - Must happen before
7955                                                             any following
7956                                                             global/generic
7957                                                             load/load
7958                                                             atomic/atomicrmw.
7959                                                           - Ensures that
7960                                                             following
7961                                                             loads will not see
7962                                                             stale L1 global data,
7963                                                             nor see stale L2 MTYPE
7964                                                             NC global data.
7965                                                             MTYPE RW and CC memory will
7966                                                             never be stale in L2 due to
7967                                                             the memory probes.
7968
7969     fence        acq_rel      - singlethread *none*     *none*
7970                               - wavefront
7971     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7972
7973                                                           - Use lgkmcnt(0) if not
7974                                                             TgSplit execution mode
7975                                                             and vmcnt(0) if TgSplit
7976                                                             execution mode.
7977                                                           - If OpenCL and
7978                                                             address space is
7979                                                             not generic, omit
7980                                                             lgkmcnt(0).
7981                                                           - If OpenCL and
7982                                                             address space is
7983                                                             local, omit
7984                                                             vmcnt(0).
7985                                                           - However,
7986                                                             since LLVM
7987                                                             currently has no
7988                                                             address space on
7989                                                             the fence need to
7990                                                             conservatively
7991                                                             always generate
7992                                                             (see comment for
7993                                                             previous fence).
7994                                                           - s_waitcnt vmcnt(0)
7995                                                             must happen after
7996                                                             any preceding
7997                                                             global/generic
7998                                                             load/store/
7999                                                             load atomic/store atomic/
8000                                                             atomicrmw.
8001                                                           - s_waitcnt lgkmcnt(0)
8002                                                             must happen after
8003                                                             any preceding
8004                                                             local/generic
8005                                                             load/load
8006                                                             atomic/store/store
8007                                                             atomic/atomicrmw.
8008                                                           - Must happen before
8009                                                             any following
8010                                                             global/generic
8011                                                             load/load
8012                                                             atomic/store/store
8013                                                             atomic/atomicrmw.
8014                                                           - Ensures that all
8015                                                             memory operations
8016                                                             have
8017                                                             completed before
8018                                                             performing any
8019                                                             following global
8020                                                             memory operations.
8021                                                           - Ensures that the
8022                                                             preceding
8023                                                             local/generic load
8024                                                             atomic/atomicrmw
8025                                                             with an equal or
8026                                                             wider sync scope
8027                                                             and memory ordering
8028                                                             stronger than
8029                                                             unordered (this is
8030                                                             termed the
8031                                                             acquire-fence-paired-atomic)
8032                                                             has completed
8033                                                             before following
8034                                                             global memory
8035                                                             operations. This
8036                                                             satisfies the
8037                                                             requirements of
8038                                                             acquire.
8039                                                           - Ensures that all
8040                                                             previous memory
8041                                                             operations have
8042                                                             completed before a
8043                                                             following
8044                                                             local/generic store
8045                                                             atomic/atomicrmw
8046                                                             with an equal or
8047                                                             wider sync scope
8048                                                             and memory ordering
8049                                                             stronger than
8050                                                             unordered (this is
8051                                                             termed the
8052                                                             release-fence-paired-atomic).
8053                                                             This satisfies the
8054                                                             requirements of
8055                                                             release.
8056                                                           - Must happen before
8057                                                             the following
8058                                                             buffer_wbinvl1_vol.
8059                                                           - Ensures that the
8060                                                             acquire-fence-paired
8061                                                             atomic has completed
8062                                                             before invalidating
8063                                                             the
8064                                                             cache. Therefore
8065                                                             any following
8066                                                             locations read must
8067                                                             be no older than
8068                                                             the value read by
8069                                                             the
8070                                                             acquire-fence-paired-atomic.
8071
8072                                                         2. buffer_wbinvl1_vol
8073
8074                                                           - If not TgSplit execution
8075                                                             mode, omit.
8076                                                           - Ensures that
8077                                                             following
8078                                                             loads will not see
8079                                                             stale data.
8080
8081     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8082                                                            vmcnt(0)
8083
8084                                                           - If TgSplit execution mode,
8085                                                             omit lgkmcnt(0).
8086                                                           - If OpenCL and
8087                                                             address space is
8088                                                             not generic, omit
8089                                                             lgkmcnt(0).
8090                                                           - However, since LLVM
8091                                                             currently has no
8092                                                             address space on
8093                                                             the fence need to
8094                                                             conservatively
8095                                                             always generate
8096                                                             (see comment for
8097                                                             previous fence).
8098                                                           - Could be split into
8099                                                             separate s_waitcnt
8100                                                             vmcnt(0) and
8101                                                             s_waitcnt
8102                                                             lgkmcnt(0) to allow
8103                                                             them to be
8104                                                             independently moved
8105                                                             according to the
8106                                                             following rules.
8107                                                           - s_waitcnt vmcnt(0)
8108                                                             must happen after
8109                                                             any preceding
8110                                                             global/generic
8111                                                             load/store/load
8112                                                             atomic/store
8113                                                             atomic/atomicrmw.
8114                                                           - s_waitcnt lgkmcnt(0)
8115                                                             must happen after
8116                                                             any preceding
8117                                                             local/generic
8118                                                             load/store/load
8119                                                             atomic/store
8120                                                             atomic/atomicrmw.
8121                                                           - Must happen before
8122                                                             the following
8123                                                             buffer_wbinvl1_vol.
8124                                                           - Ensures that the
8125                                                             preceding
8126                                                             global/local/generic
8127                                                             load
8128                                                             atomic/atomicrmw
8129                                                             with an equal or
8130                                                             wider sync scope
8131                                                             and memory ordering
8132                                                             stronger than
8133                                                             unordered (this is
8134                                                             termed the
8135                                                             acquire-fence-paired-atomic)
8136                                                             has completed
8137                                                             before invalidating
8138                                                             the cache. This
8139                                                             satisfies the
8140                                                             requirements of
8141                                                             acquire.
8142                                                           - Ensures that all
8143                                                             previous memory
8144                                                             operations have
8145                                                             completed before a
8146                                                             following
8147                                                             global/local/generic
8148                                                             store
8149                                                             atomic/atomicrmw
8150                                                             with an equal or
8151                                                             wider sync scope
8152                                                             and memory ordering
8153                                                             stronger than
8154                                                             unordered (this is
8155                                                             termed the
8156                                                             release-fence-paired-atomic).
8157                                                             This satisfies the
8158                                                             requirements of
8159                                                             release.
8160
8161                                                         2. buffer_wbinvl1_vol
8162
8163                                                           - Must happen before
8164                                                             any following
8165                                                             global/generic
8166                                                             load/load
8167                                                             atomic/store/store
8168                                                             atomic/atomicrmw.
8169                                                           - Ensures that
8170                                                             following loads
8171                                                             will not see stale
8172                                                             global data. This
8173                                                             satisfies the
8174                                                             requirements of
8175                                                             acquire.
8176
8177     fence        acq_rel      - system       *none*     1. buffer_wbl2
8178
8179                                                           - If OpenCL and
8180                                                             address space is
8181                                                             local, omit.
8182                                                           - Must happen before
8183                                                             following s_waitcnt.
8184                                                           - Performs L2 writeback to
8185                                                             ensure previous
8186                                                             global/generic
8187                                                             store/atomicrmw are
8188                                                             visible at system scope.
8189
8190                                                         2. s_waitcnt lgkmcnt(0) &
8191                                                            vmcnt(0)
8192
8193                                                           - If TgSplit execution mode,
8194                                                             omit lgkmcnt(0).
8195                                                           - If OpenCL and
8196                                                             address space is
8197                                                             not generic, omit
8198                                                             lgkmcnt(0).
8199                                                           - However, since LLVM
8200                                                             currently has no
8201                                                             address space on
8202                                                             the fence need to
8203                                                             conservatively
8204                                                             always generate
8205                                                             (see comment for
8206                                                             previous fence).
8207                                                           - Could be split into
8208                                                             separate s_waitcnt
8209                                                             vmcnt(0) and
8210                                                             s_waitcnt
8211                                                             lgkmcnt(0) to allow
8212                                                             them to be
8213                                                             independently moved
8214                                                             according to the
8215                                                             following rules.
8216                                                           - s_waitcnt vmcnt(0)
8217                                                             must happen after
8218                                                             any preceding
8219                                                             global/generic
8220                                                             load/store/load
8221                                                             atomic/store
8222                                                             atomic/atomicrmw.
8223                                                           - s_waitcnt lgkmcnt(0)
8224                                                             must happen after
8225                                                             any preceding
8226                                                             local/generic
8227                                                             load/store/load
8228                                                             atomic/store
8229                                                             atomic/atomicrmw.
8230                                                           - Must happen before
8231                                                             the following buffer_invl2 and
8232                                                             buffer_wbinvl1_vol.
8233                                                           - Ensures that the
8234                                                             preceding
8235                                                             global/local/generic
8236                                                             load
8237                                                             atomic/atomicrmw
8238                                                             with an equal or
8239                                                             wider sync scope
8240                                                             and memory ordering
8241                                                             stronger than
8242                                                             unordered (this is
8243                                                             termed the
8244                                                             acquire-fence-paired-atomic)
8245                                                             has completed
8246                                                             before invalidating
8247                                                             the cache. This
8248                                                             satisfies the
8249                                                             requirements of
8250                                                             acquire.
8251                                                           - Ensures that all
8252                                                             previous memory
8253                                                             operations have
8254                                                             completed before a
8255                                                             following
8256                                                             global/local/generic
8257                                                             store
8258                                                             atomic/atomicrmw
8259                                                             with an equal or
8260                                                             wider sync scope
8261                                                             and memory ordering
8262                                                             stronger than
8263                                                             unordered (this is
8264                                                             termed the
8265                                                             release-fence-paired-atomic).
8266                                                             This satisfies the
8267                                                             requirements of
8268                                                             release.
8269
8270                                                         3.  buffer_invl2;
8271                                                             buffer_wbinvl1_vol
8272
8273                                                           - Must happen before
8274                                                             any following
8275                                                             global/generic
8276                                                             load/load
8277                                                             atomic/store/store
8278                                                             atomic/atomicrmw.
8279                                                           - Ensures that
8280                                                             following
8281                                                             loads will not see
8282                                                             stale L1 global data,
8283                                                             nor see stale L2 MTYPE
8284                                                             NC global data.
8285                                                             MTYPE RW and CC memory will
8286                                                             never be stale in L2 due to
8287                                                             the memory probes.
8288
8289     **Sequential Consistent Atomic**
8290     ------------------------------------------------------------------------------------
8291     load atomic  seq_cst      - singlethread - global   *Same as corresponding
8292                               - wavefront    - local    load atomic acquire,
8293                                              - generic  except must generate
8294                                                         all instructions even
8295                                                         for OpenCL.*
8296     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8297                                              - generic
8298                                                           - Use lgkmcnt(0) if not
8299                                                             TgSplit execution mode
8300                                                             and vmcnt(0) if TgSplit
8301                                                             execution mode.
8302                                                           - s_waitcnt lgkmcnt(0) must
8303                                                             happen after
8304                                                             preceding
8305                                                             local/generic load
8306                                                             atomic/store
8307                                                             atomic/atomicrmw
8308                                                             with memory
8309                                                             ordering of seq_cst
8310                                                             and with equal or
8311                                                             wider sync scope.
8312                                                             (Note that seq_cst
8313                                                             fences have their
8314                                                             own s_waitcnt
8315                                                             lgkmcnt(0) and so do
8316                                                             not need to be
8317                                                             considered.)
8318                                                           - s_waitcnt vmcnt(0)
8319                                                             must happen after
8320                                                             preceding
8321                                                             global/generic load
8322                                                             atomic/store
8323                                                             atomic/atomicrmw
8324                                                             with memory
8325                                                             ordering of seq_cst
8326                                                             and with equal or
8327                                                             wider sync scope.
8328                                                             (Note that seq_cst
8329                                                             fences have their
8330                                                             own s_waitcnt
8331                                                             vmcnt(0) and so do
8332                                                             not need to be
8333                                                             considered.)
8334                                                           - Ensures any
8335                                                             preceding
8336                                                             sequential
8337                                                             consistent global/local
8338                                                             memory instructions
8339                                                             have completed
8340                                                             before executing
8341                                                             this sequentially
8342                                                             consistent
8343                                                             instruction. This
8344                                                             prevents reordering
8345                                                             a seq_cst store
8346                                                             followed by a
8347                                                             seq_cst load. (Note
8348                                                             that seq_cst is
8349                                                             stronger than
8350                                                             acquire/release as
8351                                                             the reordering of
8352                                                             load acquire
8353                                                             followed by a store
8354                                                             release is
8355                                                             prevented by the
8356                                                             s_waitcnt of
8357                                                             the release, but
8358                                                             there is nothing
8359                                                             preventing a store
8360                                                             release followed by
8361                                                             load acquire from
8362                                                             completing out of
8363                                                             order. The s_waitcnt
8364                                                             could be placed after
8365                                                             seq_store or before
8366                                                             the seq_load. We
8367                                                             choose the load to
8368                                                             make the s_waitcnt be
8369                                                             as late as possible
8370                                                             so that the store
8371                                                             may have already
8372                                                             completed.)
8373
8374                                                         2. *Following
8375                                                            instructions same as
8376                                                            corresponding load
8377                                                            atomic acquire,
8378                                                            except must generate
8379                                                            all instructions even
8380                                                            for OpenCL.*
8381     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8382                                                         local address space cannot
8383                                                         be used.*
8384
8385                                                         *Same as corresponding
8386                                                         load atomic acquire,
8387                                                         except must generate
8388                                                         all instructions even
8389                                                         for OpenCL.*
8390
8391     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8392                               - system       - generic     vmcnt(0)
8393
8394                                                           - If TgSplit execution mode,
8395                                                             omit lgkmcnt(0).
8396                                                           - Could be split into
8397                                                             separate s_waitcnt
8398                                                             vmcnt(0)
8399                                                             and s_waitcnt
8400                                                             lgkmcnt(0) to allow
8401                                                             them to be
8402                                                             independently moved
8403                                                             according to the
8404                                                             following rules.
8405                                                           - s_waitcnt lgkmcnt(0)
8406                                                             must happen after
8407                                                             preceding
8408                                                             global/generic load
8409                                                             atomic/store
8410                                                             atomic/atomicrmw
8411                                                             with memory
8412                                                             ordering of seq_cst
8413                                                             and with equal or
8414                                                             wider sync scope.
8415                                                             (Note that seq_cst
8416                                                             fences have their
8417                                                             own s_waitcnt
8418                                                             lgkmcnt(0) and so do
8419                                                             not need to be
8420                                                             considered.)
8421                                                           - s_waitcnt vmcnt(0)
8422                                                             must happen after
8423                                                             preceding
8424                                                             global/generic load
8425                                                             atomic/store
8426                                                             atomic/atomicrmw
8427                                                             with memory
8428                                                             ordering of seq_cst
8429                                                             and with equal or
8430                                                             wider sync scope.
8431                                                             (Note that seq_cst
8432                                                             fences have their
8433                                                             own s_waitcnt
8434                                                             vmcnt(0) and so do
8435                                                             not need to be
8436                                                             considered.)
8437                                                           - Ensures any
8438                                                             preceding
8439                                                             sequential
8440                                                             consistent global
8441                                                             memory instructions
8442                                                             have completed
8443                                                             before executing
8444                                                             this sequentially
8445                                                             consistent
8446                                                             instruction. This
8447                                                             prevents reordering
8448                                                             a seq_cst store
8449                                                             followed by a
8450                                                             seq_cst load. (Note
8451                                                             that seq_cst is
8452                                                             stronger than
8453                                                             acquire/release as
8454                                                             the reordering of
8455                                                             load acquire
8456                                                             followed by a store
8457                                                             release is
8458                                                             prevented by the
8459                                                             s_waitcnt of
8460                                                             the release, but
8461                                                             there is nothing
8462                                                             preventing a store
8463                                                             release followed by
8464                                                             load acquire from
8465                                                             completing out of
8466                                                             order. The s_waitcnt
8467                                                             could be placed after
8468                                                             seq_store or before
8469                                                             the seq_load. We
8470                                                             choose the load to
8471                                                             make the s_waitcnt be
8472                                                             as late as possible
8473                                                             so that the store
8474                                                             may have already
8475                                                             completed.)
8476
8477                                                         2. *Following
8478                                                            instructions same as
8479                                                            corresponding load
8480                                                            atomic acquire,
8481                                                            except must generate
8482                                                            all instructions even
8483                                                            for OpenCL.*
8484     store atomic seq_cst      - singlethread - global   *Same as corresponding
8485                               - wavefront    - local    store atomic release,
8486                               - workgroup    - generic  except must generate
8487                               - agent                   all instructions even
8488                               - system                  for OpenCL.*
8489     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8490                               - wavefront    - local    atomicrmw acq_rel,
8491                               - workgroup    - generic  except must generate
8492                               - agent                   all instructions even
8493                               - system                  for OpenCL.*
8494     fence        seq_cst      - singlethread *none*     *Same as corresponding
8495                               - wavefront               fence acq_rel,
8496                               - workgroup               except must generate
8497                               - agent                   all instructions even
8498                               - system                  for OpenCL.*
8499     ============ ============ ============== ========== ================================
8500
8501.. _amdgpu-amdhsa-memory-model-gfx10:
8502
8503Memory Model GFX10
8504++++++++++++++++++
8505
8506For GFX10:
8507
8508* Each agent has multiple shader arrays (SA).
8509* Each SA has multiple work-group processors (WGP).
8510* Each WGP has multiple compute units (CU).
8511* Each CU has multiple SIMDs that execute wavefronts.
8512* The wavefronts for a single work-group are executed in the same
8513  WGP. In CU wavefront execution mode the wavefronts may be executed by
8514  different SIMDs in the same CU. In WGP wavefront execution mode the
8515  wavefronts may be executed by different SIMDs in different CUs in the same
8516  WGP.
8517* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
8518  executing on it.
8519* All LDS operations of a WGP are performed as wavefront wide operations in a
8520  global order and involve no caching. Completion is reported to a wavefront in
8521  execution order.
8522* The LDS memory has multiple request queues shared by the SIMDs of a
8523  WGP. Therefore, the LDS operations performed by different wavefronts of a
8524  work-group can be reordered relative to each other, which can result in
8525  reordering the visibility of vector memory operations with respect to LDS
8526  operations of other wavefronts in the same work-group. A ``s_waitcnt
8527  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8528  vector memory operations between wavefronts of a work-group, but not between
8529  operations performed by the same wavefront.
8530* The vector memory operations are performed as wavefront wide operations.
8531  Completion of load/store/sample operations are reported to a wavefront in
8532  execution order of other load/store/sample operations performed by that
8533  wavefront.
8534* The vector memory operations access a vector L0 cache. There is a single L0
8535  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
8536  special action is required for coherence between the lanes of a single
8537  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
8538  wavefronts executing in the same work-group as they may be executing on SIMDs
8539  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
8540  required for coherence between wavefronts executing in different work-groups
8541  as they may be executing on different WGPs.
8542* The scalar memory operations access a scalar L0 cache shared by all wavefronts
8543  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
8544  operations are used in a restricted way so do not impact the memory model. See
8545  :ref:`amdgpu-amdhsa-memory-spaces`.
8546* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
8547  the same SA. Therefore, no special action is required for coherence between
8548  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
8549  required for coherence between wavefronts executing in different work-groups
8550  as they may be executing on different SAs that access different L1s.
8551* The L1 caches have independent quadrants to service disjoint ranges of virtual
8552  addresses.
8553* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
8554  vector and scalar memory operations performed by different wavefronts, whether
8555  executing in the same or different work-groups (which may be executing on
8556  different CUs accessing different L0s), can be reordered relative to each
8557  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
8558  synchronization between vector memory operations of different wavefronts. It
8559  ensures a previous vector memory operation has completed before executing a
8560  subsequent vector memory or LDS operation and so can be used to meet the
8561  requirements of acquire, release and sequential consistency.
8562* The L1 caches use an L2 cache shared by all SAs on the same agent.
8563* The L2 cache has independent channels to service disjoint ranges of virtual
8564  addresses.
8565* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
8566  quadrant has a separate request queue per L2 channel. Therefore, the vector
8567  and scalar memory operations performed by wavefronts executing in different
8568  work-groups (which may be executing on different SAs) of an agent can be
8569  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
8570  required to ensure synchronization between vector memory operations of
8571  different SAs. It ensures a previous vector memory operation has completed
8572  before executing a subsequent vector memory and so can be used to meet the
8573  requirements of acquire, release and sequential consistency.
8574* The L2 cache can be kept coherent with other agents on some targets, or ranges
8575  of virtual addresses can be set up to bypass it to ensure system coherence.
8576* On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory.
8577  The MALL cache is fully coherent with GPU memory and has no impact on system
8578  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
8579
8580Scalar memory operations are only used to access memory that is proven to not
8581change during the execution of the kernel dispatch. This includes constant
8582address space and global address space for program scope ``const`` variables.
8583Therefore, the kernel machine code does not have to maintain the scalar cache to
8584ensure it is coherent with the vector caches. The scalar and vector caches are
8585invalidated between kernel dispatches by CP since constant address space data
8586may change between kernel dispatch executions. See
8587:ref:`amdgpu-amdhsa-memory-spaces`.
8588
8589The one exception is if scalar writes are used to spill SGPR registers. In this
8590case the AMDGPU backend ensures the memory location used to spill is never
8591accessed by vector memory operations at the same time. If scalar writes are used
8592then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8593return since the locations may be used for vector memory instructions by a
8594future wavefront that uses the same scratch area, or a function call that
8595creates a frame at the same address, respectively. There is no need for a
8596``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8597
8598For kernarg backing memory:
8599
8600* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
8601* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
8602  needing to invalidate the L2 cache.
8603* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8604  so the L2 cache will be coherent with the CPU and other agents.
8605
8606Scratch backing memory (which is used for the private address space) is accessed
8607with MTYPE NC (non-coherent). Since the private address space is only accessed
8608by a single thread, and is always write-before-read, there is never a need to
8609invalidate these entries from the L0 or L1 caches.
8610
8611Wavefronts are executed in native mode with in-order reporting of loads and
8612sample instructions. In this mode vmcnt reports completion of load, atomic with
8613return and sample instructions in order, and the vscnt reports the completion of
8614store and atomic without return in order. See ``MEM_ORDERED`` field in
8615:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
8616
8617Wavefronts can be executed in WGP or CU wavefront execution mode:
8618
8619* In WGP wavefront execution mode the wavefronts of a work-group are executed
8620  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
8621  CU L0 caches is required for work-group synchronization. Also accesses to L1
8622  at work-group scope need to be explicitly ordered as the accesses from
8623  different CUs are not ordered.
8624* In CU wavefront execution mode the wavefronts of a work-group are executed on
8625  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
8626  the work-group access the same L0 which in turn ensures L1 accesses are
8627  ordered and so do not require explicit management of the caches for
8628  work-group synchronization.
8629
8630See ``WGP_MODE`` field in
8631:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
8632:ref:`amdgpu-target-features`.
8633
8634The code sequences used to implement the memory model for GFX10 are defined in
8635table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
8636
8637  .. table:: AMDHSA Memory Model Code Sequences GFX10
8638     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
8639
8640     ============ ============ ============== ========== ================================
8641     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8642                  Ordering     Sync Scope     Address    GFX10
8643                                              Space
8644     ============ ============ ============== ========== ================================
8645     **Non-Atomic**
8646     ------------------------------------------------------------------------------------
8647     load         *none*       *none*         - global   - !volatile & !nontemporal
8648                                              - generic
8649                                              - private    1. buffer/global/flat_load
8650                                              - constant
8651                                                         - !volatile & nontemporal
8652
8653                                                           1. buffer/global/flat_load
8654                                                              slc=1
8655
8656                                                         - volatile
8657
8658                                                           1. buffer/global/flat_load
8659                                                              glc=1 dlc=1
8660                                                           2. s_waitcnt vmcnt(0)
8661
8662                                                            - Must happen before
8663                                                              any following volatile
8664                                                              global/generic
8665                                                              load/store.
8666                                                            - Ensures that
8667                                                              volatile
8668                                                              operations to
8669                                                              different
8670                                                              addresses will not
8671                                                              be reordered by
8672                                                              hardware.
8673
8674     load         *none*       *none*         - local    1. ds_load
8675     store        *none*       *none*         - global   - !volatile & !nontemporal
8676                                              - generic
8677                                              - private    1. buffer/global/flat_store
8678                                              - constant
8679                                                         - !volatile & nontemporal
8680
8681                                                           1. buffer/global/flat_store
8682                                                              glc=1 slc=1
8683
8684                                                         - volatile
8685
8686                                                           1. buffer/global/flat_store
8687                                                           2. s_waitcnt vscnt(0)
8688
8689                                                            - Must happen before
8690                                                              any following volatile
8691                                                              global/generic
8692                                                              load/store.
8693                                                            - Ensures that
8694                                                              volatile
8695                                                              operations to
8696                                                              different
8697                                                              addresses will not
8698                                                              be reordered by
8699                                                              hardware.
8700
8701     store        *none*       *none*         - local    1. ds_store
8702     **Unordered Atomic**
8703     ------------------------------------------------------------------------------------
8704     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8705     store atomic unordered    *any*          *any*      *Same as non-atomic*.
8706     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8707     **Monotonic Atomic**
8708     ------------------------------------------------------------------------------------
8709     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8710                               - wavefront    - generic
8711     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8712                                              - generic     glc=1
8713
8714                                                           - If CU wavefront execution
8715                                                             mode, omit glc=1.
8716
8717     load atomic  monotonic    - singlethread - local    1. ds_load
8718                               - wavefront
8719                               - workgroup
8720     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8721                               - system       - generic     glc=1 dlc=1
8722     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8723                               - wavefront    - generic
8724                               - workgroup
8725                               - agent
8726                               - system
8727     store atomic monotonic    - singlethread - local    1. ds_store
8728                               - wavefront
8729                               - workgroup
8730     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
8731                               - wavefront    - generic
8732                               - workgroup
8733                               - agent
8734                               - system
8735     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
8736                               - wavefront
8737                               - workgroup
8738     **Acquire Atomic**
8739     ------------------------------------------------------------------------------------
8740     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
8741                               - wavefront    - local
8742                                              - generic
8743     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
8744
8745                                                           - If CU wavefront execution
8746                                                             mode, omit glc=1.
8747
8748                                                         2. s_waitcnt vmcnt(0)
8749
8750                                                           - If CU wavefront execution
8751                                                             mode, omit.
8752                                                           - Must happen before
8753                                                             the following buffer_gl0_inv
8754                                                             and before any following
8755                                                             global/generic
8756                                                             load/load
8757                                                             atomic/store/store
8758                                                             atomic/atomicrmw.
8759
8760                                                         3. buffer_gl0_inv
8761
8762                                                           - If CU wavefront execution
8763                                                             mode, omit.
8764                                                           - Ensures that
8765                                                             following
8766                                                             loads will not see
8767                                                             stale data.
8768
8769     load atomic  acquire      - workgroup    - local    1. ds_load
8770                                                         2. s_waitcnt lgkmcnt(0)
8771
8772                                                           - If OpenCL, omit.
8773                                                           - Must happen before
8774                                                             the following buffer_gl0_inv
8775                                                             and before any following
8776                                                             global/generic load/load
8777                                                             atomic/store/store
8778                                                             atomic/atomicrmw.
8779                                                           - Ensures any
8780                                                             following global
8781                                                             data read is no
8782                                                             older than the local load
8783                                                             atomic value being
8784                                                             acquired.
8785
8786                                                         3. buffer_gl0_inv
8787
8788                                                           - If CU wavefront execution
8789                                                             mode, omit.
8790                                                           - If OpenCL, omit.
8791                                                           - Ensures that
8792                                                             following
8793                                                             loads will not see
8794                                                             stale data.
8795
8796     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
8797
8798                                                           - If CU wavefront execution
8799                                                             mode, omit glc=1.
8800
8801                                                         2. s_waitcnt lgkmcnt(0) &
8802                                                            vmcnt(0)
8803
8804                                                           - If CU wavefront execution
8805                                                             mode, omit vmcnt(0).
8806                                                           - If OpenCL, omit
8807                                                             lgkmcnt(0).
8808                                                           - Must happen before
8809                                                             the following
8810                                                             buffer_gl0_inv and any
8811                                                             following global/generic
8812                                                             load/load
8813                                                             atomic/store/store
8814                                                             atomic/atomicrmw.
8815                                                           - Ensures any
8816                                                             following global
8817                                                             data read is no
8818                                                             older than a local load
8819                                                             atomic value being
8820                                                             acquired.
8821
8822                                                         3. buffer_gl0_inv
8823
8824                                                           - If CU wavefront execution
8825                                                             mode, omit.
8826                                                           - Ensures that
8827                                                             following
8828                                                             loads will not see
8829                                                             stale data.
8830
8831     load atomic  acquire      - agent        - global   1. buffer/global_load
8832                               - system                     glc=1 dlc=1
8833                                                         2. s_waitcnt vmcnt(0)
8834
8835                                                           - Must happen before
8836                                                             following
8837                                                             buffer_gl*_inv.
8838                                                           - Ensures the load
8839                                                             has completed
8840                                                             before invalidating
8841                                                             the caches.
8842
8843                                                         3. buffer_gl0_inv;
8844                                                            buffer_gl1_inv
8845
8846                                                           - Must happen before
8847                                                             any following
8848                                                             global/generic
8849                                                             load/load
8850                                                             atomic/atomicrmw.
8851                                                           - Ensures that
8852                                                             following
8853                                                             loads will not see
8854                                                             stale global data.
8855
8856     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
8857                               - system                  2. s_waitcnt vmcnt(0) &
8858                                                            lgkmcnt(0)
8859
8860                                                           - If OpenCL omit
8861                                                             lgkmcnt(0).
8862                                                           - Must happen before
8863                                                             following
8864                                                             buffer_gl*_invl.
8865                                                           - Ensures the flat_load
8866                                                             has completed
8867                                                             before invalidating
8868                                                             the caches.
8869
8870                                                         3. buffer_gl0_inv;
8871                                                            buffer_gl1_inv
8872
8873                                                           - Must happen before
8874                                                             any following
8875                                                             global/generic
8876                                                             load/load
8877                                                             atomic/atomicrmw.
8878                                                           - Ensures that
8879                                                             following loads
8880                                                             will not see stale
8881                                                             global data.
8882
8883     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
8884                               - wavefront    - local
8885                                              - generic
8886     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
8887                                                         2. s_waitcnt vm/vscnt(0)
8888
8889                                                           - If CU wavefront execution
8890                                                             mode, omit.
8891                                                           - Use vmcnt(0) if atomic with
8892                                                             return and vscnt(0) if
8893                                                             atomic with no-return.
8894                                                           - Must happen before
8895                                                             the following buffer_gl0_inv
8896                                                             and before any following
8897                                                             global/generic
8898                                                             load/load
8899                                                             atomic/store/store
8900                                                             atomic/atomicrmw.
8901
8902                                                         3. buffer_gl0_inv
8903
8904                                                           - If CU wavefront execution
8905                                                             mode, omit.
8906                                                           - Ensures that
8907                                                             following
8908                                                             loads will not see
8909                                                             stale data.
8910
8911     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
8912                                                         2. s_waitcnt lgkmcnt(0)
8913
8914                                                           - If OpenCL, omit.
8915                                                           - Must happen before
8916                                                             the following
8917                                                             buffer_gl0_inv.
8918                                                           - Ensures any
8919                                                             following global
8920                                                             data read is no
8921                                                             older than the local
8922                                                             atomicrmw value
8923                                                             being acquired.
8924
8925                                                         3. buffer_gl0_inv
8926
8927                                                           - If OpenCL omit.
8928                                                           - Ensures that
8929                                                             following
8930                                                             loads will not see
8931                                                             stale data.
8932
8933     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
8934                                                         2. s_waitcnt lgkmcnt(0) &
8935                                                            vm/vscnt(0)
8936
8937                                                           - If CU wavefront execution
8938                                                             mode, omit vm/vscnt(0).
8939                                                           - If OpenCL, omit lgkmcnt(0).
8940                                                           - Use vmcnt(0) if atomic with
8941                                                             return and vscnt(0) if
8942                                                             atomic with no-return.
8943                                                           - Must happen before
8944                                                             the following
8945                                                             buffer_gl0_inv.
8946                                                           - Ensures any
8947                                                             following global
8948                                                             data read is no
8949                                                             older than a local
8950                                                             atomicrmw value
8951                                                             being acquired.
8952
8953                                                         3. buffer_gl0_inv
8954
8955                                                           - If CU wavefront execution
8956                                                             mode, omit.
8957                                                           - Ensures that
8958                                                             following
8959                                                             loads will not see
8960                                                             stale data.
8961
8962     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
8963                               - system                  2. s_waitcnt vm/vscnt(0)
8964
8965                                                           - Use vmcnt(0) if atomic with
8966                                                             return and vscnt(0) if
8967                                                             atomic with no-return.
8968                                                           - Must happen before
8969                                                             following
8970                                                             buffer_gl*_inv.
8971                                                           - Ensures the
8972                                                             atomicrmw has
8973                                                             completed before
8974                                                             invalidating the
8975                                                             caches.
8976
8977                                                         3. buffer_gl0_inv;
8978                                                            buffer_gl1_inv
8979
8980                                                           - Must happen before
8981                                                             any following
8982                                                             global/generic
8983                                                             load/load
8984                                                             atomic/atomicrmw.
8985                                                           - Ensures that
8986                                                             following loads
8987                                                             will not see stale
8988                                                             global data.
8989
8990     atomicrmw    acquire      - agent        - generic  1. flat_atomic
8991                               - system                  2. s_waitcnt vm/vscnt(0) &
8992                                                            lgkmcnt(0)
8993
8994                                                           - If OpenCL, omit
8995                                                             lgkmcnt(0).
8996                                                           - Use vmcnt(0) if atomic with
8997                                                             return and vscnt(0) if
8998                                                             atomic with no-return.
8999                                                           - Must happen before
9000                                                             following
9001                                                             buffer_gl*_inv.
9002                                                           - Ensures the
9003                                                             atomicrmw has
9004                                                             completed before
9005                                                             invalidating the
9006                                                             caches.
9007
9008                                                         3. buffer_gl0_inv;
9009                                                            buffer_gl1_inv
9010
9011                                                           - Must happen before
9012                                                             any following
9013                                                             global/generic
9014                                                             load/load
9015                                                             atomic/atomicrmw.
9016                                                           - Ensures that
9017                                                             following loads
9018                                                             will not see stale
9019                                                             global data.
9020
9021     fence        acquire      - singlethread *none*     *none*
9022                               - wavefront
9023     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9024                                                            vmcnt(0) & vscnt(0)
9025
9026                                                           - If CU wavefront execution
9027                                                             mode, omit vmcnt(0) and
9028                                                             vscnt(0).
9029                                                           - If OpenCL and
9030                                                             address space is
9031                                                             not generic, omit
9032                                                             lgkmcnt(0).
9033                                                           - If OpenCL and
9034                                                             address space is
9035                                                             local, omit
9036                                                             vmcnt(0) and vscnt(0).
9037                                                           - However, since LLVM
9038                                                             currently has no
9039                                                             address space on
9040                                                             the fence need to
9041                                                             conservatively
9042                                                             always generate. If
9043                                                             fence had an
9044                                                             address space then
9045                                                             set to address
9046                                                             space of OpenCL
9047                                                             fence flag, or to
9048                                                             generic if both
9049                                                             local and global
9050                                                             flags are
9051                                                             specified.
9052                                                           - Could be split into
9053                                                             separate s_waitcnt
9054                                                             vmcnt(0), s_waitcnt
9055                                                             vscnt(0) and s_waitcnt
9056                                                             lgkmcnt(0) to allow
9057                                                             them to be
9058                                                             independently moved
9059                                                             according to the
9060                                                             following rules.
9061                                                           - s_waitcnt vmcnt(0)
9062                                                             must happen after
9063                                                             any preceding
9064                                                             global/generic load
9065                                                             atomic/
9066                                                             atomicrmw-with-return-value
9067                                                             with an equal or
9068                                                             wider sync scope
9069                                                             and memory ordering
9070                                                             stronger than
9071                                                             unordered (this is
9072                                                             termed the
9073                                                             fence-paired-atomic).
9074                                                           - s_waitcnt vscnt(0)
9075                                                             must happen after
9076                                                             any preceding
9077                                                             global/generic
9078                                                             atomicrmw-no-return-value
9079                                                             with an equal or
9080                                                             wider sync scope
9081                                                             and memory ordering
9082                                                             stronger than
9083                                                             unordered (this is
9084                                                             termed the
9085                                                             fence-paired-atomic).
9086                                                           - s_waitcnt lgkmcnt(0)
9087                                                             must happen after
9088                                                             any preceding
9089                                                             local/generic load
9090                                                             atomic/atomicrmw
9091                                                             with an equal or
9092                                                             wider sync scope
9093                                                             and memory ordering
9094                                                             stronger than
9095                                                             unordered (this is
9096                                                             termed the
9097                                                             fence-paired-atomic).
9098                                                           - Must happen before
9099                                                             the following
9100                                                             buffer_gl0_inv.
9101                                                           - Ensures that the
9102                                                             fence-paired atomic
9103                                                             has completed
9104                                                             before invalidating
9105                                                             the
9106                                                             cache. Therefore
9107                                                             any following
9108                                                             locations read must
9109                                                             be no older than
9110                                                             the value read by
9111                                                             the
9112                                                             fence-paired-atomic.
9113
9114                                                         3. buffer_gl0_inv
9115
9116                                                           - If CU wavefront execution
9117                                                             mode, omit.
9118                                                           - Ensures that
9119                                                             following
9120                                                             loads will not see
9121                                                             stale data.
9122
9123     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9124                               - system                     vmcnt(0) & vscnt(0)
9125
9126                                                           - If OpenCL and
9127                                                             address space is
9128                                                             not generic, omit
9129                                                             lgkmcnt(0).
9130                                                           - If OpenCL and
9131                                                             address space is
9132                                                             local, omit
9133                                                             vmcnt(0) and vscnt(0).
9134                                                           - However, since LLVM
9135                                                             currently has no
9136                                                             address space on
9137                                                             the fence need to
9138                                                             conservatively
9139                                                             always generate
9140                                                             (see comment for
9141                                                             previous fence).
9142                                                           - Could be split into
9143                                                             separate s_waitcnt
9144                                                             vmcnt(0), s_waitcnt
9145                                                             vscnt(0) and s_waitcnt
9146                                                             lgkmcnt(0) to allow
9147                                                             them to be
9148                                                             independently moved
9149                                                             according to the
9150                                                             following rules.
9151                                                           - s_waitcnt vmcnt(0)
9152                                                             must happen after
9153                                                             any preceding
9154                                                             global/generic load
9155                                                             atomic/
9156                                                             atomicrmw-with-return-value
9157                                                             with an equal or
9158                                                             wider sync scope
9159                                                             and memory ordering
9160                                                             stronger than
9161                                                             unordered (this is
9162                                                             termed the
9163                                                             fence-paired-atomic).
9164                                                           - s_waitcnt vscnt(0)
9165                                                             must happen after
9166                                                             any preceding
9167                                                             global/generic
9168                                                             atomicrmw-no-return-value
9169                                                             with an equal or
9170                                                             wider sync scope
9171                                                             and memory ordering
9172                                                             stronger than
9173                                                             unordered (this is
9174                                                             termed the
9175                                                             fence-paired-atomic).
9176                                                           - s_waitcnt lgkmcnt(0)
9177                                                             must happen after
9178                                                             any preceding
9179                                                             local/generic load
9180                                                             atomic/atomicrmw
9181                                                             with an equal or
9182                                                             wider sync scope
9183                                                             and memory ordering
9184                                                             stronger than
9185                                                             unordered (this is
9186                                                             termed the
9187                                                             fence-paired-atomic).
9188                                                           - Must happen before
9189                                                             the following
9190                                                             buffer_gl*_inv.
9191                                                           - Ensures that the
9192                                                             fence-paired atomic
9193                                                             has completed
9194                                                             before invalidating
9195                                                             the
9196                                                             caches. Therefore
9197                                                             any following
9198                                                             locations read must
9199                                                             be no older than
9200                                                             the value read by
9201                                                             the
9202                                                             fence-paired-atomic.
9203
9204                                                         2. buffer_gl0_inv;
9205                                                            buffer_gl1_inv
9206
9207                                                           - Must happen before any
9208                                                             following global/generic
9209                                                             load/load
9210                                                             atomic/store/store
9211                                                             atomic/atomicrmw.
9212                                                           - Ensures that
9213                                                             following loads
9214                                                             will not see stale
9215                                                             global data.
9216
9217     **Release Atomic**
9218     ------------------------------------------------------------------------------------
9219     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
9220                               - wavefront    - local
9221                                              - generic
9222     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9223                                              - generic     vmcnt(0) & vscnt(0)
9224
9225                                                           - If CU wavefront execution
9226                                                             mode, omit vmcnt(0) and
9227                                                             vscnt(0).
9228                                                           - If OpenCL, omit
9229                                                             lgkmcnt(0).
9230                                                           - Could be split into
9231                                                             separate s_waitcnt
9232                                                             vmcnt(0), s_waitcnt
9233                                                             vscnt(0) and s_waitcnt
9234                                                             lgkmcnt(0) to allow
9235                                                             them to be
9236                                                             independently moved
9237                                                             according to the
9238                                                             following rules.
9239                                                           - s_waitcnt vmcnt(0)
9240                                                             must happen after
9241                                                             any preceding
9242                                                             global/generic load/load
9243                                                             atomic/
9244                                                             atomicrmw-with-return-value.
9245                                                           - s_waitcnt vscnt(0)
9246                                                             must happen after
9247                                                             any preceding
9248                                                             global/generic
9249                                                             store/store
9250                                                             atomic/
9251                                                             atomicrmw-no-return-value.
9252                                                           - s_waitcnt lgkmcnt(0)
9253                                                             must happen after
9254                                                             any preceding
9255                                                             local/generic
9256                                                             load/store/load
9257                                                             atomic/store
9258                                                             atomic/atomicrmw.
9259                                                           - Must happen before
9260                                                             the following
9261                                                             store.
9262                                                           - Ensures that all
9263                                                             memory operations
9264                                                             have
9265                                                             completed before
9266                                                             performing the
9267                                                             store that is being
9268                                                             released.
9269
9270                                                         2. buffer/global/flat_store
9271     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9272
9273                                                           - If CU wavefront execution
9274                                                             mode, omit.
9275                                                           - If OpenCL, omit.
9276                                                           - Could be split into
9277                                                             separate s_waitcnt
9278                                                             vmcnt(0) and s_waitcnt
9279                                                             vscnt(0) to allow
9280                                                             them to be
9281                                                             independently moved
9282                                                             according to the
9283                                                             following rules.
9284                                                           - s_waitcnt vmcnt(0)
9285                                                             must happen after
9286                                                             any preceding
9287                                                             global/generic load/load
9288                                                             atomic/
9289                                                             atomicrmw-with-return-value.
9290                                                           - s_waitcnt vscnt(0)
9291                                                             must happen after
9292                                                             any preceding
9293                                                             global/generic
9294                                                             store/store atomic/
9295                                                             atomicrmw-no-return-value.
9296                                                           - Must happen before
9297                                                             the following
9298                                                             store.
9299                                                           - Ensures that all
9300                                                             global memory
9301                                                             operations have
9302                                                             completed before
9303                                                             performing the
9304                                                             store that is being
9305                                                             released.
9306
9307                                                         2. ds_store
9308     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9309                               - system       - generic     vmcnt(0) & vscnt(0)
9310
9311                                                           - If OpenCL and
9312                                                             address space is
9313                                                             not generic, omit
9314                                                             lgkmcnt(0).
9315                                                           - Could be split into
9316                                                             separate s_waitcnt
9317                                                             vmcnt(0), s_waitcnt vscnt(0)
9318                                                             and s_waitcnt
9319                                                             lgkmcnt(0) to allow
9320                                                             them to be
9321                                                             independently moved
9322                                                             according to the
9323                                                             following rules.
9324                                                           - s_waitcnt vmcnt(0)
9325                                                             must happen after
9326                                                             any preceding
9327                                                             global/generic
9328                                                             load/load
9329                                                             atomic/
9330                                                             atomicrmw-with-return-value.
9331                                                           - s_waitcnt vscnt(0)
9332                                                             must happen after
9333                                                             any preceding
9334                                                             global/generic
9335                                                             store/store atomic/
9336                                                             atomicrmw-no-return-value.
9337                                                           - s_waitcnt lgkmcnt(0)
9338                                                             must happen after
9339                                                             any preceding
9340                                                             local/generic
9341                                                             load/store/load
9342                                                             atomic/store
9343                                                             atomic/atomicrmw.
9344                                                           - Must happen before
9345                                                             the following
9346                                                             store.
9347                                                           - Ensures that all
9348                                                             memory operations
9349                                                             have
9350                                                             completed before
9351                                                             performing the
9352                                                             store that is being
9353                                                             released.
9354
9355                                                         2. buffer/global/flat_store
9356     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
9357                               - wavefront    - local
9358                                              - generic
9359     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9360                                              - generic     vmcnt(0) & vscnt(0)
9361
9362                                                           - If CU wavefront execution
9363                                                             mode, omit vmcnt(0) and
9364                                                             vscnt(0).
9365                                                           - If OpenCL, omit lgkmcnt(0).
9366                                                           - Could be split into
9367                                                             separate s_waitcnt
9368                                                             vmcnt(0), s_waitcnt
9369                                                             vscnt(0) and s_waitcnt
9370                                                             lgkmcnt(0) to allow
9371                                                             them to be
9372                                                             independently moved
9373                                                             according to the
9374                                                             following rules.
9375                                                           - s_waitcnt vmcnt(0)
9376                                                             must happen after
9377                                                             any preceding
9378                                                             global/generic load/load
9379                                                             atomic/
9380                                                             atomicrmw-with-return-value.
9381                                                           - s_waitcnt vscnt(0)
9382                                                             must happen after
9383                                                             any preceding
9384                                                             global/generic
9385                                                             store/store
9386                                                             atomic/
9387                                                             atomicrmw-no-return-value.
9388                                                           - s_waitcnt lgkmcnt(0)
9389                                                             must happen after
9390                                                             any preceding
9391                                                             local/generic
9392                                                             load/store/load
9393                                                             atomic/store
9394                                                             atomic/atomicrmw.
9395                                                           - Must happen before
9396                                                             the following
9397                                                             atomicrmw.
9398                                                           - Ensures that all
9399                                                             memory operations
9400                                                             have
9401                                                             completed before
9402                                                             performing the
9403                                                             atomicrmw that is
9404                                                             being released.
9405
9406                                                         2. buffer/global/flat_atomic
9407     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9408
9409                                                           - If CU wavefront execution
9410                                                             mode, omit.
9411                                                           - If OpenCL, omit.
9412                                                           - Could be split into
9413                                                             separate s_waitcnt
9414                                                             vmcnt(0) and s_waitcnt
9415                                                             vscnt(0) to allow
9416                                                             them to be
9417                                                             independently moved
9418                                                             according to the
9419                                                             following rules.
9420                                                           - s_waitcnt vmcnt(0)
9421                                                             must happen after
9422                                                             any preceding
9423                                                             global/generic load/load
9424                                                             atomic/
9425                                                             atomicrmw-with-return-value.
9426                                                           - s_waitcnt vscnt(0)
9427                                                             must happen after
9428                                                             any preceding
9429                                                             global/generic
9430                                                             store/store atomic/
9431                                                             atomicrmw-no-return-value.
9432                                                           - Must happen before
9433                                                             the following
9434                                                             store.
9435                                                           - Ensures that all
9436                                                             global memory
9437                                                             operations have
9438                                                             completed before
9439                                                             performing the
9440                                                             store that is being
9441                                                             released.
9442
9443                                                         2. ds_atomic
9444     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9445                               - system       - generic      vmcnt(0) & vscnt(0)
9446
9447                                                           - If OpenCL, omit
9448                                                             lgkmcnt(0).
9449                                                           - Could be split into
9450                                                             separate s_waitcnt
9451                                                             vmcnt(0), s_waitcnt
9452                                                             vscnt(0) and s_waitcnt
9453                                                             lgkmcnt(0) to allow
9454                                                             them to be
9455                                                             independently moved
9456                                                             according to the
9457                                                             following rules.
9458                                                           - s_waitcnt vmcnt(0)
9459                                                             must happen after
9460                                                             any preceding
9461                                                             global/generic
9462                                                             load/load atomic/
9463                                                             atomicrmw-with-return-value.
9464                                                           - s_waitcnt vscnt(0)
9465                                                             must happen after
9466                                                             any preceding
9467                                                             global/generic
9468                                                             store/store atomic/
9469                                                             atomicrmw-no-return-value.
9470                                                           - s_waitcnt lgkmcnt(0)
9471                                                             must happen after
9472                                                             any preceding
9473                                                             local/generic
9474                                                             load/store/load
9475                                                             atomic/store
9476                                                             atomic/atomicrmw.
9477                                                           - Must happen before
9478                                                             the following
9479                                                             atomicrmw.
9480                                                           - Ensures that all
9481                                                             memory operations
9482                                                             to global and local
9483                                                             have completed
9484                                                             before performing
9485                                                             the atomicrmw that
9486                                                             is being released.
9487
9488                                                         2. buffer/global/flat_atomic
9489     fence        release      - singlethread *none*     *none*
9490                               - wavefront
9491     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9492                                                            vmcnt(0) & vscnt(0)
9493
9494                                                           - If CU wavefront execution
9495                                                             mode, omit vmcnt(0) and
9496                                                             vscnt(0).
9497                                                           - If OpenCL and
9498                                                             address space is
9499                                                             not generic, omit
9500                                                             lgkmcnt(0).
9501                                                           - If OpenCL and
9502                                                             address space is
9503                                                             local, omit
9504                                                             vmcnt(0) and vscnt(0).
9505                                                           - However, since LLVM
9506                                                             currently has no
9507                                                             address space on
9508                                                             the fence need to
9509                                                             conservatively
9510                                                             always generate. If
9511                                                             fence had an
9512                                                             address space then
9513                                                             set to address
9514                                                             space of OpenCL
9515                                                             fence flag, or to
9516                                                             generic if both
9517                                                             local and global
9518                                                             flags are
9519                                                             specified.
9520                                                           - Could be split into
9521                                                             separate s_waitcnt
9522                                                             vmcnt(0), s_waitcnt
9523                                                             vscnt(0) and s_waitcnt
9524                                                             lgkmcnt(0) to allow
9525                                                             them to be
9526                                                             independently moved
9527                                                             according to the
9528                                                             following rules.
9529                                                           - s_waitcnt vmcnt(0)
9530                                                             must happen after
9531                                                             any preceding
9532                                                             global/generic
9533                                                             load/load
9534                                                             atomic/
9535                                                             atomicrmw-with-return-value.
9536                                                           - s_waitcnt vscnt(0)
9537                                                             must happen after
9538                                                             any preceding
9539                                                             global/generic
9540                                                             store/store atomic/
9541                                                             atomicrmw-no-return-value.
9542                                                           - s_waitcnt lgkmcnt(0)
9543                                                             must happen after
9544                                                             any preceding
9545                                                             local/generic
9546                                                             load/store/load
9547                                                             atomic/store atomic/
9548                                                             atomicrmw.
9549                                                           - Must happen before
9550                                                             any following store
9551                                                             atomic/atomicrmw
9552                                                             with an equal or
9553                                                             wider sync scope
9554                                                             and memory ordering
9555                                                             stronger than
9556                                                             unordered (this is
9557                                                             termed the
9558                                                             fence-paired-atomic).
9559                                                           - Ensures that all
9560                                                             memory operations
9561                                                             have
9562                                                             completed before
9563                                                             performing the
9564                                                             following
9565                                                             fence-paired-atomic.
9566
9567     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9568                               - system                     vmcnt(0) & vscnt(0)
9569
9570                                                           - If OpenCL and
9571                                                             address space is
9572                                                             not generic, omit
9573                                                             lgkmcnt(0).
9574                                                           - If OpenCL and
9575                                                             address space is
9576                                                             local, omit
9577                                                             vmcnt(0) and vscnt(0).
9578                                                           - However, since LLVM
9579                                                             currently has no
9580                                                             address space on
9581                                                             the fence need to
9582                                                             conservatively
9583                                                             always generate. If
9584                                                             fence had an
9585                                                             address space then
9586                                                             set to address
9587                                                             space of OpenCL
9588                                                             fence flag, or to
9589                                                             generic if both
9590                                                             local and global
9591                                                             flags are
9592                                                             specified.
9593                                                           - Could be split into
9594                                                             separate s_waitcnt
9595                                                             vmcnt(0), s_waitcnt
9596                                                             vscnt(0) and s_waitcnt
9597                                                             lgkmcnt(0) to allow
9598                                                             them to be
9599                                                             independently moved
9600                                                             according to the
9601                                                             following rules.
9602                                                           - s_waitcnt vmcnt(0)
9603                                                             must happen after
9604                                                             any preceding
9605                                                             global/generic
9606                                                             load/load atomic/
9607                                                             atomicrmw-with-return-value.
9608                                                           - s_waitcnt vscnt(0)
9609                                                             must happen after
9610                                                             any preceding
9611                                                             global/generic
9612                                                             store/store atomic/
9613                                                             atomicrmw-no-return-value.
9614                                                           - s_waitcnt lgkmcnt(0)
9615                                                             must happen after
9616                                                             any preceding
9617                                                             local/generic
9618                                                             load/store/load
9619                                                             atomic/store
9620                                                             atomic/atomicrmw.
9621                                                           - Must happen before
9622                                                             any following store
9623                                                             atomic/atomicrmw
9624                                                             with an equal or
9625                                                             wider sync scope
9626                                                             and memory ordering
9627                                                             stronger than
9628                                                             unordered (this is
9629                                                             termed the
9630                                                             fence-paired-atomic).
9631                                                           - Ensures that all
9632                                                             memory operations
9633                                                             have
9634                                                             completed before
9635                                                             performing the
9636                                                             following
9637                                                             fence-paired-atomic.
9638
9639     **Acquire-Release Atomic**
9640     ------------------------------------------------------------------------------------
9641     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
9642                               - wavefront    - local
9643                                              - generic
9644     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9645                                                            vmcnt(0) & vscnt(0)
9646
9647                                                           - If CU wavefront execution
9648                                                             mode, omit vmcnt(0) and
9649                                                             vscnt(0).
9650                                                           - If OpenCL, omit
9651                                                             lgkmcnt(0).
9652                                                           - Must happen after
9653                                                             any preceding
9654                                                             local/generic
9655                                                             load/store/load
9656                                                             atomic/store
9657                                                             atomic/atomicrmw.
9658                                                           - Could be split into
9659                                                             separate s_waitcnt
9660                                                             vmcnt(0), s_waitcnt
9661                                                             vscnt(0), and s_waitcnt
9662                                                             lgkmcnt(0) to allow
9663                                                             them to be
9664                                                             independently moved
9665                                                             according to the
9666                                                             following rules.
9667                                                           - s_waitcnt vmcnt(0)
9668                                                             must happen after
9669                                                             any preceding
9670                                                             global/generic load/load
9671                                                             atomic/
9672                                                             atomicrmw-with-return-value.
9673                                                           - s_waitcnt vscnt(0)
9674                                                             must happen after
9675                                                             any preceding
9676                                                             global/generic
9677                                                             store/store
9678                                                             atomic/
9679                                                             atomicrmw-no-return-value.
9680                                                           - s_waitcnt lgkmcnt(0)
9681                                                             must happen after
9682                                                             any preceding
9683                                                             local/generic
9684                                                             load/store/load
9685                                                             atomic/store
9686                                                             atomic/atomicrmw.
9687                                                           - Must happen before
9688                                                             the following
9689                                                             atomicrmw.
9690                                                           - Ensures that all
9691                                                             memory operations
9692                                                             have
9693                                                             completed before
9694                                                             performing the
9695                                                             atomicrmw that is
9696                                                             being released.
9697
9698                                                         2. buffer/global_atomic
9699                                                         3. s_waitcnt vm/vscnt(0)
9700
9701                                                           - If CU wavefront execution
9702                                                             mode, omit.
9703                                                           - Use vmcnt(0) if atomic with
9704                                                             return and vscnt(0) if
9705                                                             atomic with no-return.
9706                                                           - Must happen before
9707                                                             the following
9708                                                             buffer_gl0_inv.
9709                                                           - Ensures any
9710                                                             following global
9711                                                             data read is no
9712                                                             older than the
9713                                                             atomicrmw value
9714                                                             being acquired.
9715
9716                                                         4. buffer_gl0_inv
9717
9718                                                           - If CU wavefront execution
9719                                                             mode, omit.
9720                                                           - Ensures that
9721                                                             following
9722                                                             loads will not see
9723                                                             stale data.
9724
9725     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9726
9727                                                           - If CU wavefront execution
9728                                                             mode, omit.
9729                                                           - If OpenCL, omit.
9730                                                           - Could be split into
9731                                                             separate s_waitcnt
9732                                                             vmcnt(0) and s_waitcnt
9733                                                             vscnt(0) to allow
9734                                                             them to be
9735                                                             independently moved
9736                                                             according to the
9737                                                             following rules.
9738                                                           - s_waitcnt vmcnt(0)
9739                                                             must happen after
9740                                                             any preceding
9741                                                             global/generic load/load
9742                                                             atomic/
9743                                                             atomicrmw-with-return-value.
9744                                                           - s_waitcnt vscnt(0)
9745                                                             must happen after
9746                                                             any preceding
9747                                                             global/generic
9748                                                             store/store atomic/
9749                                                             atomicrmw-no-return-value.
9750                                                           - Must happen before
9751                                                             the following
9752                                                             store.
9753                                                           - Ensures that all
9754                                                             global memory
9755                                                             operations have
9756                                                             completed before
9757                                                             performing the
9758                                                             store that is being
9759                                                             released.
9760
9761                                                         2. ds_atomic
9762                                                         3. s_waitcnt lgkmcnt(0)
9763
9764                                                           - If OpenCL, omit.
9765                                                           - Must happen before
9766                                                             the following
9767                                                             buffer_gl0_inv.
9768                                                           - Ensures any
9769                                                             following global
9770                                                             data read is no
9771                                                             older than the local load
9772                                                             atomic value being
9773                                                             acquired.
9774
9775                                                         4. buffer_gl0_inv
9776
9777                                                           - If CU wavefront execution
9778                                                             mode, omit.
9779                                                           - If OpenCL omit.
9780                                                           - Ensures that
9781                                                             following
9782                                                             loads will not see
9783                                                             stale data.
9784
9785     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
9786                                                            vmcnt(0) & vscnt(0)
9787
9788                                                           - If CU wavefront execution
9789                                                             mode, omit vmcnt(0) and
9790                                                             vscnt(0).
9791                                                           - If OpenCL, omit lgkmcnt(0).
9792                                                           - Could be split into
9793                                                             separate s_waitcnt
9794                                                             vmcnt(0), s_waitcnt
9795                                                             vscnt(0) and s_waitcnt
9796                                                             lgkmcnt(0) to allow
9797                                                             them to be
9798                                                             independently moved
9799                                                             according to the
9800                                                             following rules.
9801                                                           - s_waitcnt vmcnt(0)
9802                                                             must happen after
9803                                                             any preceding
9804                                                             global/generic load/load
9805                                                             atomic/
9806                                                             atomicrmw-with-return-value.
9807                                                           - s_waitcnt vscnt(0)
9808                                                             must happen after
9809                                                             any preceding
9810                                                             global/generic
9811                                                             store/store
9812                                                             atomic/
9813                                                             atomicrmw-no-return-value.
9814                                                           - s_waitcnt lgkmcnt(0)
9815                                                             must happen after
9816                                                             any preceding
9817                                                             local/generic
9818                                                             load/store/load
9819                                                             atomic/store
9820                                                             atomic/atomicrmw.
9821                                                           - Must happen before
9822                                                             the following
9823                                                             atomicrmw.
9824                                                           - Ensures that all
9825                                                             memory operations
9826                                                             have
9827                                                             completed before
9828                                                             performing the
9829                                                             atomicrmw that is
9830                                                             being released.
9831
9832                                                         2. flat_atomic
9833                                                         3. s_waitcnt lgkmcnt(0) &
9834                                                            vmcnt(0) & vscnt(0)
9835
9836                                                           - If CU wavefront execution
9837                                                             mode, omit vmcnt(0) and
9838                                                             vscnt(0).
9839                                                           - If OpenCL, omit lgkmcnt(0).
9840                                                           - Must happen before
9841                                                             the following
9842                                                             buffer_gl0_inv.
9843                                                           - Ensures any
9844                                                             following global
9845                                                             data read is no
9846                                                             older than the load
9847                                                             atomic value being
9848                                                             acquired.
9849
9850                                                         3. buffer_gl0_inv
9851
9852                                                           - If CU wavefront execution
9853                                                             mode, omit.
9854                                                           - Ensures that
9855                                                             following
9856                                                             loads will not see
9857                                                             stale data.
9858
9859     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9860                               - system                     vmcnt(0) & vscnt(0)
9861
9862                                                           - If OpenCL, omit
9863                                                             lgkmcnt(0).
9864                                                           - Could be split into
9865                                                             separate s_waitcnt
9866                                                             vmcnt(0), s_waitcnt
9867                                                             vscnt(0) and s_waitcnt
9868                                                             lgkmcnt(0) to allow
9869                                                             them to be
9870                                                             independently moved
9871                                                             according to the
9872                                                             following rules.
9873                                                           - s_waitcnt vmcnt(0)
9874                                                             must happen after
9875                                                             any preceding
9876                                                             global/generic
9877                                                             load/load atomic/
9878                                                             atomicrmw-with-return-value.
9879                                                           - s_waitcnt vscnt(0)
9880                                                             must happen after
9881                                                             any preceding
9882                                                             global/generic
9883                                                             store/store atomic/
9884                                                             atomicrmw-no-return-value.
9885                                                           - s_waitcnt lgkmcnt(0)
9886                                                             must happen after
9887                                                             any preceding
9888                                                             local/generic
9889                                                             load/store/load
9890                                                             atomic/store
9891                                                             atomic/atomicrmw.
9892                                                           - Must happen before
9893                                                             the following
9894                                                             atomicrmw.
9895                                                           - Ensures that all
9896                                                             memory operations
9897                                                             to global have
9898                                                             completed before
9899                                                             performing the
9900                                                             atomicrmw that is
9901                                                             being released.
9902
9903                                                         2. buffer/global_atomic
9904                                                         3. s_waitcnt vm/vscnt(0)
9905
9906                                                           - Use vmcnt(0) if atomic with
9907                                                             return and vscnt(0) if
9908                                                             atomic with no-return.
9909                                                           - Must happen before
9910                                                             following
9911                                                             buffer_gl*_inv.
9912                                                           - Ensures the
9913                                                             atomicrmw has
9914                                                             completed before
9915                                                             invalidating the
9916                                                             caches.
9917
9918                                                         4. buffer_gl0_inv;
9919                                                            buffer_gl1_inv
9920
9921                                                           - Must happen before
9922                                                             any following
9923                                                             global/generic
9924                                                             load/load
9925                                                             atomic/atomicrmw.
9926                                                           - Ensures that
9927                                                             following loads
9928                                                             will not see stale
9929                                                             global data.
9930
9931     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
9932                               - system                     vmcnt(0) & vscnt(0)
9933
9934                                                           - If OpenCL, omit
9935                                                             lgkmcnt(0).
9936                                                           - Could be split into
9937                                                             separate s_waitcnt
9938                                                             vmcnt(0), s_waitcnt
9939                                                             vscnt(0), and s_waitcnt
9940                                                             lgkmcnt(0) to allow
9941                                                             them to be
9942                                                             independently moved
9943                                                             according to the
9944                                                             following rules.
9945                                                           - s_waitcnt vmcnt(0)
9946                                                             must happen after
9947                                                             any preceding
9948                                                             global/generic
9949                                                             load/load atomic
9950                                                             atomicrmw-with-return-value.
9951                                                           - s_waitcnt vscnt(0)
9952                                                             must happen after
9953                                                             any preceding
9954                                                             global/generic
9955                                                             store/store atomic/
9956                                                             atomicrmw-no-return-value.
9957                                                           - s_waitcnt lgkmcnt(0)
9958                                                             must happen after
9959                                                             any preceding
9960                                                             local/generic
9961                                                             load/store/load
9962                                                             atomic/store
9963                                                             atomic/atomicrmw.
9964                                                           - Must happen before
9965                                                             the following
9966                                                             atomicrmw.
9967                                                           - Ensures that all
9968                                                             memory operations
9969                                                             have
9970                                                             completed before
9971                                                             performing the
9972                                                             atomicrmw that is
9973                                                             being released.
9974
9975                                                         2. flat_atomic
9976                                                         3. s_waitcnt vm/vscnt(0) &
9977                                                            lgkmcnt(0)
9978
9979                                                           - If OpenCL, omit
9980                                                             lgkmcnt(0).
9981                                                           - Use vmcnt(0) if atomic with
9982                                                             return and vscnt(0) if
9983                                                             atomic with no-return.
9984                                                           - Must happen before
9985                                                             following
9986                                                             buffer_gl*_inv.
9987                                                           - Ensures the
9988                                                             atomicrmw has
9989                                                             completed before
9990                                                             invalidating the
9991                                                             caches.
9992
9993                                                         4. buffer_gl0_inv;
9994                                                            buffer_gl1_inv
9995
9996                                                           - Must happen before
9997                                                             any following
9998                                                             global/generic
9999                                                             load/load
10000                                                             atomic/atomicrmw.
10001                                                           - Ensures that
10002                                                             following loads
10003                                                             will not see stale
10004                                                             global data.
10005
10006     fence        acq_rel      - singlethread *none*     *none*
10007                               - wavefront
10008     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
10009                                                            vmcnt(0) & vscnt(0)
10010
10011                                                           - If CU wavefront execution
10012                                                             mode, omit vmcnt(0) and
10013                                                             vscnt(0).
10014                                                           - If OpenCL and
10015                                                             address space is
10016                                                             not generic, omit
10017                                                             lgkmcnt(0).
10018                                                           - If OpenCL and
10019                                                             address space is
10020                                                             local, omit
10021                                                             vmcnt(0) and vscnt(0).
10022                                                           - However,
10023                                                             since LLVM
10024                                                             currently has no
10025                                                             address space on
10026                                                             the fence need to
10027                                                             conservatively
10028                                                             always generate
10029                                                             (see comment for
10030                                                             previous fence).
10031                                                           - Could be split into
10032                                                             separate s_waitcnt
10033                                                             vmcnt(0), s_waitcnt
10034                                                             vscnt(0) and s_waitcnt
10035                                                             lgkmcnt(0) to allow
10036                                                             them to be
10037                                                             independently moved
10038                                                             according to the
10039                                                             following rules.
10040                                                           - s_waitcnt vmcnt(0)
10041                                                             must happen after
10042                                                             any preceding
10043                                                             global/generic
10044                                                             load/load
10045                                                             atomic/
10046                                                             atomicrmw-with-return-value.
10047                                                           - s_waitcnt vscnt(0)
10048                                                             must happen after
10049                                                             any preceding
10050                                                             global/generic
10051                                                             store/store atomic/
10052                                                             atomicrmw-no-return-value.
10053                                                           - s_waitcnt lgkmcnt(0)
10054                                                             must happen after
10055                                                             any preceding
10056                                                             local/generic
10057                                                             load/store/load
10058                                                             atomic/store atomic/
10059                                                             atomicrmw.
10060                                                           - Must happen before
10061                                                             any following
10062                                                             global/generic
10063                                                             load/load
10064                                                             atomic/store/store
10065                                                             atomic/atomicrmw.
10066                                                           - Ensures that all
10067                                                             memory operations
10068                                                             have
10069                                                             completed before
10070                                                             performing any
10071                                                             following global
10072                                                             memory operations.
10073                                                           - Ensures that the
10074                                                             preceding
10075                                                             local/generic load
10076                                                             atomic/atomicrmw
10077                                                             with an equal or
10078                                                             wider sync scope
10079                                                             and memory ordering
10080                                                             stronger than
10081                                                             unordered (this is
10082                                                             termed the
10083                                                             acquire-fence-paired-atomic)
10084                                                             has completed
10085                                                             before following
10086                                                             global memory
10087                                                             operations. This
10088                                                             satisfies the
10089                                                             requirements of
10090                                                             acquire.
10091                                                           - Ensures that all
10092                                                             previous memory
10093                                                             operations have
10094                                                             completed before a
10095                                                             following
10096                                                             local/generic store
10097                                                             atomic/atomicrmw
10098                                                             with an equal or
10099                                                             wider sync scope
10100                                                             and memory ordering
10101                                                             stronger than
10102                                                             unordered (this is
10103                                                             termed the
10104                                                             release-fence-paired-atomic).
10105                                                             This satisfies the
10106                                                             requirements of
10107                                                             release.
10108                                                           - Must happen before
10109                                                             the following
10110                                                             buffer_gl0_inv.
10111                                                           - Ensures that the
10112                                                             acquire-fence-paired
10113                                                             atomic has completed
10114                                                             before invalidating
10115                                                             the
10116                                                             cache. Therefore
10117                                                             any following
10118                                                             locations read must
10119                                                             be no older than
10120                                                             the value read by
10121                                                             the
10122                                                             acquire-fence-paired-atomic.
10123
10124                                                         3. buffer_gl0_inv
10125
10126                                                           - If CU wavefront execution
10127                                                             mode, omit.
10128                                                           - Ensures that
10129                                                             following
10130                                                             loads will not see
10131                                                             stale data.
10132
10133     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
10134                               - system                     vmcnt(0) & vscnt(0)
10135
10136                                                           - If OpenCL and
10137                                                             address space is
10138                                                             not generic, omit
10139                                                             lgkmcnt(0).
10140                                                           - If OpenCL and
10141                                                             address space is
10142                                                             local, omit
10143                                                             vmcnt(0) and vscnt(0).
10144                                                           - However, since LLVM
10145                                                             currently has no
10146                                                             address space on
10147                                                             the fence need to
10148                                                             conservatively
10149                                                             always generate
10150                                                             (see comment for
10151                                                             previous fence).
10152                                                           - Could be split into
10153                                                             separate s_waitcnt
10154                                                             vmcnt(0), s_waitcnt
10155                                                             vscnt(0) and s_waitcnt
10156                                                             lgkmcnt(0) to allow
10157                                                             them to be
10158                                                             independently moved
10159                                                             according to the
10160                                                             following rules.
10161                                                           - s_waitcnt vmcnt(0)
10162                                                             must happen after
10163                                                             any preceding
10164                                                             global/generic
10165                                                             load/load
10166                                                             atomic/
10167                                                             atomicrmw-with-return-value.
10168                                                           - s_waitcnt vscnt(0)
10169                                                             must happen after
10170                                                             any preceding
10171                                                             global/generic
10172                                                             store/store atomic/
10173                                                             atomicrmw-no-return-value.
10174                                                           - s_waitcnt lgkmcnt(0)
10175                                                             must happen after
10176                                                             any preceding
10177                                                             local/generic
10178                                                             load/store/load
10179                                                             atomic/store
10180                                                             atomic/atomicrmw.
10181                                                           - Must happen before
10182                                                             the following
10183                                                             buffer_gl*_inv.
10184                                                           - Ensures that the
10185                                                             preceding
10186                                                             global/local/generic
10187                                                             load
10188                                                             atomic/atomicrmw
10189                                                             with an equal or
10190                                                             wider sync scope
10191                                                             and memory ordering
10192                                                             stronger than
10193                                                             unordered (this is
10194                                                             termed the
10195                                                             acquire-fence-paired-atomic)
10196                                                             has completed
10197                                                             before invalidating
10198                                                             the caches. This
10199                                                             satisfies the
10200                                                             requirements of
10201                                                             acquire.
10202                                                           - Ensures that all
10203                                                             previous memory
10204                                                             operations have
10205                                                             completed before a
10206                                                             following
10207                                                             global/local/generic
10208                                                             store
10209                                                             atomic/atomicrmw
10210                                                             with an equal or
10211                                                             wider sync scope
10212                                                             and memory ordering
10213                                                             stronger than
10214                                                             unordered (this is
10215                                                             termed the
10216                                                             release-fence-paired-atomic).
10217                                                             This satisfies the
10218                                                             requirements of
10219                                                             release.
10220
10221                                                         2. buffer_gl0_inv;
10222                                                            buffer_gl1_inv
10223
10224                                                           - Must happen before
10225                                                             any following
10226                                                             global/generic
10227                                                             load/load
10228                                                             atomic/store/store
10229                                                             atomic/atomicrmw.
10230                                                           - Ensures that
10231                                                             following loads
10232                                                             will not see stale
10233                                                             global data. This
10234                                                             satisfies the
10235                                                             requirements of
10236                                                             acquire.
10237
10238     **Sequential Consistent Atomic**
10239     ------------------------------------------------------------------------------------
10240     load atomic  seq_cst      - singlethread - global   *Same as corresponding
10241                               - wavefront    - local    load atomic acquire,
10242                                              - generic  except must generate
10243                                                         all instructions even
10244                                                         for OpenCL.*
10245     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
10246                                              - generic     vmcnt(0) & vscnt(0)
10247
10248                                                           - If CU wavefront execution
10249                                                             mode, omit vmcnt(0) and
10250                                                             vscnt(0).
10251                                                           - Could be split into
10252                                                             separate s_waitcnt
10253                                                             vmcnt(0), s_waitcnt
10254                                                             vscnt(0), and s_waitcnt
10255                                                             lgkmcnt(0) to allow
10256                                                             them to be
10257                                                             independently moved
10258                                                             according to the
10259                                                             following rules.
10260                                                           - s_waitcnt lgkmcnt(0) must
10261                                                             happen after
10262                                                             preceding
10263                                                             local/generic load
10264                                                             atomic/store
10265                                                             atomic/atomicrmw
10266                                                             with memory
10267                                                             ordering of seq_cst
10268                                                             and with equal or
10269                                                             wider sync scope.
10270                                                             (Note that seq_cst
10271                                                             fences have their
10272                                                             own s_waitcnt
10273                                                             lgkmcnt(0) and so do
10274                                                             not need to be
10275                                                             considered.)
10276                                                           - s_waitcnt vmcnt(0)
10277                                                             must happen after
10278                                                             preceding
10279                                                             global/generic load
10280                                                             atomic/
10281                                                             atomicrmw-with-return-value
10282                                                             with memory
10283                                                             ordering of seq_cst
10284                                                             and with equal or
10285                                                             wider sync scope.
10286                                                             (Note that seq_cst
10287                                                             fences have their
10288                                                             own s_waitcnt
10289                                                             vmcnt(0) and so do
10290                                                             not need to be
10291                                                             considered.)
10292                                                           - s_waitcnt vscnt(0)
10293                                                             Must happen after
10294                                                             preceding
10295                                                             global/generic store
10296                                                             atomic/
10297                                                             atomicrmw-no-return-value
10298                                                             with memory
10299                                                             ordering of seq_cst
10300                                                             and with equal or
10301                                                             wider sync scope.
10302                                                             (Note that seq_cst
10303                                                             fences have their
10304                                                             own s_waitcnt
10305                                                             vscnt(0) and so do
10306                                                             not need to be
10307                                                             considered.)
10308                                                           - Ensures any
10309                                                             preceding
10310                                                             sequential
10311                                                             consistent global/local
10312                                                             memory instructions
10313                                                             have completed
10314                                                             before executing
10315                                                             this sequentially
10316                                                             consistent
10317                                                             instruction. This
10318                                                             prevents reordering
10319                                                             a seq_cst store
10320                                                             followed by a
10321                                                             seq_cst load. (Note
10322                                                             that seq_cst is
10323                                                             stronger than
10324                                                             acquire/release as
10325                                                             the reordering of
10326                                                             load acquire
10327                                                             followed by a store
10328                                                             release is
10329                                                             prevented by the
10330                                                             s_waitcnt of
10331                                                             the release, but
10332                                                             there is nothing
10333                                                             preventing a store
10334                                                             release followed by
10335                                                             load acquire from
10336                                                             completing out of
10337                                                             order. The s_waitcnt
10338                                                             could be placed after
10339                                                             seq_store or before
10340                                                             the seq_load. We
10341                                                             choose the load to
10342                                                             make the s_waitcnt be
10343                                                             as late as possible
10344                                                             so that the store
10345                                                             may have already
10346                                                             completed.)
10347
10348                                                         2. *Following
10349                                                            instructions same as
10350                                                            corresponding load
10351                                                            atomic acquire,
10352                                                            except must generate
10353                                                            all instructions even
10354                                                            for OpenCL.*
10355     load atomic  seq_cst      - workgroup    - local
10356
10357                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
10358
10359                                                           - If CU wavefront execution
10360                                                             mode, omit.
10361                                                           - Could be split into
10362                                                             separate s_waitcnt
10363                                                             vmcnt(0) and s_waitcnt
10364                                                             vscnt(0) to allow
10365                                                             them to be
10366                                                             independently moved
10367                                                             according to the
10368                                                             following rules.
10369                                                           - s_waitcnt vmcnt(0)
10370                                                             Must happen after
10371                                                             preceding
10372                                                             global/generic load
10373                                                             atomic/
10374                                                             atomicrmw-with-return-value
10375                                                             with memory
10376                                                             ordering of seq_cst
10377                                                             and with equal or
10378                                                             wider sync scope.
10379                                                             (Note that seq_cst
10380                                                             fences have their
10381                                                             own s_waitcnt
10382                                                             vmcnt(0) and so do
10383                                                             not need to be
10384                                                             considered.)
10385                                                           - s_waitcnt vscnt(0)
10386                                                             Must happen after
10387                                                             preceding
10388                                                             global/generic store
10389                                                             atomic/
10390                                                             atomicrmw-no-return-value
10391                                                             with memory
10392                                                             ordering of seq_cst
10393                                                             and with equal or
10394                                                             wider sync scope.
10395                                                             (Note that seq_cst
10396                                                             fences have their
10397                                                             own s_waitcnt
10398                                                             vscnt(0) and so do
10399                                                             not need to be
10400                                                             considered.)
10401                                                           - Ensures any
10402                                                             preceding
10403                                                             sequential
10404                                                             consistent global
10405                                                             memory instructions
10406                                                             have completed
10407                                                             before executing
10408                                                             this sequentially
10409                                                             consistent
10410                                                             instruction. This
10411                                                             prevents reordering
10412                                                             a seq_cst store
10413                                                             followed by a
10414                                                             seq_cst load. (Note
10415                                                             that seq_cst is
10416                                                             stronger than
10417                                                             acquire/release as
10418                                                             the reordering of
10419                                                             load acquire
10420                                                             followed by a store
10421                                                             release is
10422                                                             prevented by the
10423                                                             s_waitcnt of
10424                                                             the release, but
10425                                                             there is nothing
10426                                                             preventing a store
10427                                                             release followed by
10428                                                             load acquire from
10429                                                             completing out of
10430                                                             order. The s_waitcnt
10431                                                             could be placed after
10432                                                             seq_store or before
10433                                                             the seq_load. We
10434                                                             choose the load to
10435                                                             make the s_waitcnt be
10436                                                             as late as possible
10437                                                             so that the store
10438                                                             may have already
10439                                                             completed.)
10440
10441                                                         2. *Following
10442                                                            instructions same as
10443                                                            corresponding load
10444                                                            atomic acquire,
10445                                                            except must generate
10446                                                            all instructions even
10447                                                            for OpenCL.*
10448
10449     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
10450                               - system       - generic     vmcnt(0) & vscnt(0)
10451
10452                                                           - Could be split into
10453                                                             separate s_waitcnt
10454                                                             vmcnt(0), s_waitcnt
10455                                                             vscnt(0) and s_waitcnt
10456                                                             lgkmcnt(0) to allow
10457                                                             them to be
10458                                                             independently moved
10459                                                             according to the
10460                                                             following rules.
10461                                                           - s_waitcnt lgkmcnt(0)
10462                                                             must happen after
10463                                                             preceding
10464                                                             local load
10465                                                             atomic/store
10466                                                             atomic/atomicrmw
10467                                                             with memory
10468                                                             ordering of seq_cst
10469                                                             and with equal or
10470                                                             wider sync scope.
10471                                                             (Note that seq_cst
10472                                                             fences have their
10473                                                             own s_waitcnt
10474                                                             lgkmcnt(0) and so do
10475                                                             not need to be
10476                                                             considered.)
10477                                                           - s_waitcnt vmcnt(0)
10478                                                             must happen after
10479                                                             preceding
10480                                                             global/generic load
10481                                                             atomic/
10482                                                             atomicrmw-with-return-value
10483                                                             with memory
10484                                                             ordering of seq_cst
10485                                                             and with equal or
10486                                                             wider sync scope.
10487                                                             (Note that seq_cst
10488                                                             fences have their
10489                                                             own s_waitcnt
10490                                                             vmcnt(0) and so do
10491                                                             not need to be
10492                                                             considered.)
10493                                                           - s_waitcnt vscnt(0)
10494                                                             Must happen after
10495                                                             preceding
10496                                                             global/generic store
10497                                                             atomic/
10498                                                             atomicrmw-no-return-value
10499                                                             with memory
10500                                                             ordering of seq_cst
10501                                                             and with equal or
10502                                                             wider sync scope.
10503                                                             (Note that seq_cst
10504                                                             fences have their
10505                                                             own s_waitcnt
10506                                                             vscnt(0) and so do
10507                                                             not need to be
10508                                                             considered.)
10509                                                           - Ensures any
10510                                                             preceding
10511                                                             sequential
10512                                                             consistent global
10513                                                             memory instructions
10514                                                             have completed
10515                                                             before executing
10516                                                             this sequentially
10517                                                             consistent
10518                                                             instruction. This
10519                                                             prevents reordering
10520                                                             a seq_cst store
10521                                                             followed by a
10522                                                             seq_cst load. (Note
10523                                                             that seq_cst is
10524                                                             stronger than
10525                                                             acquire/release as
10526                                                             the reordering of
10527                                                             load acquire
10528                                                             followed by a store
10529                                                             release is
10530                                                             prevented by the
10531                                                             s_waitcnt of
10532                                                             the release, but
10533                                                             there is nothing
10534                                                             preventing a store
10535                                                             release followed by
10536                                                             load acquire from
10537                                                             completing out of
10538                                                             order. The s_waitcnt
10539                                                             could be placed after
10540                                                             seq_store or before
10541                                                             the seq_load. We
10542                                                             choose the load to
10543                                                             make the s_waitcnt be
10544                                                             as late as possible
10545                                                             so that the store
10546                                                             may have already
10547                                                             completed.)
10548
10549                                                         2. *Following
10550                                                            instructions same as
10551                                                            corresponding load
10552                                                            atomic acquire,
10553                                                            except must generate
10554                                                            all instructions even
10555                                                            for OpenCL.*
10556     store atomic seq_cst      - singlethread - global   *Same as corresponding
10557                               - wavefront    - local    store atomic release,
10558                               - workgroup    - generic  except must generate
10559                               - agent                   all instructions even
10560                               - system                  for OpenCL.*
10561     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
10562                               - wavefront    - local    atomicrmw acq_rel,
10563                               - workgroup    - generic  except must generate
10564                               - agent                   all instructions even
10565                               - system                  for OpenCL.*
10566     fence        seq_cst      - singlethread *none*     *Same as corresponding
10567                               - wavefront               fence acq_rel,
10568                               - workgroup               except must generate
10569                               - agent                   all instructions even
10570                               - system                  for OpenCL.*
10571     ============ ============ ============== ========== ================================
10572
10573Trap Handler ABI
10574~~~~~~~~~~~~~~~~
10575
10576For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
10577runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
10578supports the ``s_trap`` instruction. For usage see:
10579
10580- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
10581- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
10582- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table`
10583
10584  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
10585     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
10586
10587     =================== =============== =============== =======================================
10588     Usage               Code Sequence   Trap Handler    Description
10589                                         Inputs
10590     =================== =============== =============== =======================================
10591     reserved            ``s_trap 0x00``                 Reserved by hardware.
10592     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
10593                                           ``queue_ptr`` intrinsic (not implemented).
10594                                         ``VGPR0``:
10595                                           ``arg``
10596     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10597                                           ``queue_ptr`` the trap instruction. The associated
10598                                                         queue is signalled to put it into the
10599                                                         error state.  When the queue is put in
10600                                                         the error state, the waves executing
10601                                                         dispatches on the queue will be
10602                                                         terminated.
10603     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10604                                                           as a no-operation. The trap handler
10605                                                           is entered and immediately returns to
10606                                                           continue execution of the wavefront.
10607                                                         - If the debugger is enabled, causes
10608                                                           the debug trap to be reported by the
10609                                                           debugger and the wavefront is put in
10610                                                           the halt state with the PC at the
10611                                                           instruction.  The debugger must
10612                                                           increment the PC and resume the wave.
10613     reserved            ``s_trap 0x04``                 Reserved.
10614     reserved            ``s_trap 0x05``                 Reserved.
10615     reserved            ``s_trap 0x06``                 Reserved.
10616     reserved            ``s_trap 0x07``                 Reserved.
10617     reserved            ``s_trap 0x08``                 Reserved.
10618     reserved            ``s_trap 0xfe``                 Reserved.
10619     reserved            ``s_trap 0xff``                 Reserved.
10620     =================== =============== =============== =======================================
10621
10622..
10623
10624  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
10625     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
10626
10627     =================== =============== =============== =======================================
10628     Usage               Code Sequence   Trap Handler    Description
10629                                         Inputs
10630     =================== =============== =============== =======================================
10631     reserved            ``s_trap 0x00``                 Reserved by hardware.
10632     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
10633                                                         breakpoints. Causes wave to be halted
10634                                                         with the PC at the trap instruction.
10635                                                         The debugger is responsible to resume
10636                                                         the wave, including the instruction
10637                                                         that the breakpoint overwrote.
10638     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10639                                           ``queue_ptr`` the trap instruction. The associated
10640                                                         queue is signalled to put it into the
10641                                                         error state.  When the queue is put in
10642                                                         the error state, the waves executing
10643                                                         dispatches on the queue will be
10644                                                         terminated.
10645     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10646                                                           as a no-operation. The trap handler
10647                                                           is entered and immediately returns to
10648                                                           continue execution of the wavefront.
10649                                                         - If the debugger is enabled, causes
10650                                                           the debug trap to be reported by the
10651                                                           debugger and the wavefront is put in
10652                                                           the halt state with the PC at the
10653                                                           instruction.  The debugger must
10654                                                           increment the PC and resume the wave.
10655     reserved            ``s_trap 0x04``                 Reserved.
10656     reserved            ``s_trap 0x05``                 Reserved.
10657     reserved            ``s_trap 0x06``                 Reserved.
10658     reserved            ``s_trap 0x07``                 Reserved.
10659     reserved            ``s_trap 0x08``                 Reserved.
10660     reserved            ``s_trap 0xfe``                 Reserved.
10661     reserved            ``s_trap 0xff``                 Reserved.
10662     =================== =============== =============== =======================================
10663
10664..
10665
10666  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4
10667     :name: amdgpu-trap-handler-for-amdhsa-os-v4-table
10668
10669     =================== =============== ================ ================= =======================================
10670     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
10671     =================== =============== ================ ================= =======================================
10672     reserved            ``s_trap 0x00``                                    Reserved by hardware.
10673     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
10674                                                                            breakpoints. Causes wave to be halted
10675                                                                            with the PC at the trap instruction.
10676                                                                            The debugger is responsible to resume
10677                                                                            the wave, including the instruction
10678                                                                            that the breakpoint overwrote.
10679     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
10680                                           ``queue_ptr``                    the trap instruction. The associated
10681                                                                            queue is signalled to put it into the
10682                                                                            error state.  When the queue is put in
10683                                                                            the error state, the waves executing
10684                                                                            dispatches on the queue will be
10685                                                                            terminated.
10686     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
10687                                                                              as a no-operation. The trap handler
10688                                                                              is entered and immediately returns to
10689                                                                              continue execution of the wavefront.
10690                                                                            - If the debugger is enabled, causes
10691                                                                              the debug trap to be reported by the
10692                                                                              debugger and the wavefront is put in
10693                                                                              the halt state with the PC at the
10694                                                                              instruction.  The debugger must
10695                                                                              increment the PC and resume the wave.
10696     reserved            ``s_trap 0x04``                                    Reserved.
10697     reserved            ``s_trap 0x05``                                    Reserved.
10698     reserved            ``s_trap 0x06``                                    Reserved.
10699     reserved            ``s_trap 0x07``                                    Reserved.
10700     reserved            ``s_trap 0x08``                                    Reserved.
10701     reserved            ``s_trap 0xfe``                                    Reserved.
10702     reserved            ``s_trap 0xff``                                    Reserved.
10703     =================== =============== ================ ================= =======================================
10704
10705.. _amdgpu-amdhsa-function-call-convention:
10706
10707Call Convention
10708~~~~~~~~~~~~~~~
10709
10710.. note::
10711
10712  This section is currently incomplete and has inaccuracies. It is WIP that will
10713  be updated as information is determined.
10714
10715See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
10716addresses. Unswizzled addresses are normal linear addresses.
10717
10718.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
10719
10720Kernel Functions
10721++++++++++++++++
10722
10723This section describes the call convention ABI for the outer kernel function.
10724
10725See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
10726convention.
10727
10728The following is not part of the AMDGPU kernel calling convention but describes
10729how the AMDGPU implements function calls:
10730
107311.  Clang decides the kernarg layout to match the *HSA Programmer's Language
10732    Reference* [HSA]_.
10733
10734    - All structs are passed directly.
10735    - Lambda values are passed *TBA*.
10736
10737    .. TODO::
10738
10739      - Does this really follow HSA rules? Or are structs >16 bytes passed
10740        by-value struct?
10741      - What is ABI for lambda values?
10742
107434.  The kernel performs certain setup in its prolog, as described in
10744    :ref:`amdgpu-amdhsa-kernel-prolog`.
10745
10746.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
10747
10748Non-Kernel Functions
10749++++++++++++++++++++
10750
10751This section describes the call convention ABI for functions other than the
10752outer kernel function.
10753
10754If a kernel has function calls then scratch is always allocated and used for
10755the call stack which grows from low address to high address using the swizzled
10756scratch address space.
10757
10758On entry to a function:
10759
107601.  SGPR0-3 contain a V# with the following properties (see
10761    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
10762
10763    * Base address pointing to the beginning of the wavefront scratch backing
10764      memory.
10765    * Swizzled with dword element size and stride of wavefront size elements.
10766
107672.  The FLAT_SCRATCH register pair is setup. See
10768    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
107693.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
10770    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
107714.  The EXEC register is set to the lanes active on entry to the function.
107725.  MODE register: *TBD*
107736.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
10774    below.
107757.  SGPR30-31 return address (RA). The code address that the function must
10776    return to when it completes. The value is undefined if the function is *no
10777    return*.
107788.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
10779    offset relative to the beginning of the wavefront scratch backing memory.
10780
10781    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
10782    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
10783    manner.
10784
10785    The unswizzled SP value can be converted into the swizzled SP value by:
10786
10787      | swizzled SP = unswizzled SP / wavefront size
10788
10789    This may be used to obtain the private address space address of stack
10790    objects and to convert this address to a flat address by adding the flat
10791    scratch aperture base address.
10792
10793    The swizzled SP value is always 4 bytes aligned for the ``r600``
10794    architecture and 16 byte aligned for the ``amdgcn`` architecture.
10795
10796    .. note::
10797
10798      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
10799      OpenCL language which has the largest base type defined as 16 bytes.
10800
10801    On entry, the swizzled SP value is the address of the first function
10802    argument passed on the stack. Other stack passed arguments are positive
10803    offsets from the entry swizzled SP value.
10804
10805    The function may use positive offsets beyond the last stack passed argument
10806    for stack allocated local variables and register spill slots. If necessary,
10807    the function may align these to greater alignment than 16 bytes. After these
10808    the function may dynamically allocate space for such things as runtime sized
10809    ``alloca`` local allocations.
10810
10811    If the function calls another function, it will place any stack allocated
10812    arguments after the last local allocation and adjust SGPR32 to the address
10813    after the last local allocation.
10814
108159.  All other registers are unspecified.
1081610. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
10817    to the function.
10818
10819On exit from a function:
10820
108211.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
10822    described below. Any registers used are considered clobbered registers.
108232.  The following registers are preserved and have the same value as on entry:
10824
10825    * FLAT_SCRATCH
10826    * EXEC
10827    * GFX6-GFX8: M0
10828    * All SGPR registers except the clobbered registers of SGPR4-31.
10829    * VGPR40-47
10830    * VGPR56-63
10831    * VGPR72-79
10832    * VGPR88-95
10833    * VGPR104-111
10834    * VGPR120-127
10835    * VGPR136-143
10836    * VGPR152-159
10837    * VGPR168-175
10838    * VGPR184-191
10839    * VGPR200-207
10840    * VGPR216-223
10841    * VGPR232-239
10842    * VGPR248-255
10843
10844        .. note::
10845
10846          Except the argument registers, the VGPRs clobbered and the preserved
10847          registers are intermixed at regular intervals in order to keep a
10848          similar ratio independent of the number of allocated VGPRs.
10849
10850    * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
10851    * Lanes of all VGPRs that are inactive at the call site.
10852
10853      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
10854      optimization may mark some of clobbered SGPR and VGPR registers as
10855      preserved if it can be determined that the called function does not change
10856      their value.
10857
108582.  The PC is set to the RA provided on entry.
108593.  MODE register: *TBD*.
108604.  All other registers are clobbered.
108615.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
10862    function is available to the caller.
10863
10864.. TODO::
10865
10866  - How are function results returned? The address of structured types is passed
10867    by reference, but what about other types?
10868
10869The function input arguments are made up of the formal arguments explicitly
10870declared by the source language function plus the implicit input arguments used
10871by the implementation.
10872
10873The source language input arguments are:
10874
108751. Any source language implicit ``this`` or ``self`` argument comes first as a
10876   pointer type.
108772. Followed by the function formal arguments in left to right source order.
10878
10879The source language result arguments are:
10880
108811. The function result argument.
10882
10883The source language input or result struct type arguments that are less than or
10884equal to 16 bytes, are decomposed recursively into their base type fields, and
10885each field is passed as if a separate argument. For input arguments, if the
10886called function requires the struct to be in memory, for example because its
10887address is taken, then the function body is responsible for allocating a stack
10888location and copying the field arguments into it. Clang terms this *direct
10889struct*.
10890
10891The source language input struct type arguments that are greater than 16 bytes,
10892are passed by reference. The caller is responsible for allocating a stack
10893location to make a copy of the struct value and pass the address as the input
10894argument. The called function is responsible to perform the dereference when
10895accessing the input argument. Clang terms this *by-value struct*.
10896
10897A source language result struct type argument that is greater than 16 bytes, is
10898returned by reference. The caller is responsible for allocating a stack location
10899to hold the result value and passes the address as the last input argument
10900(before the implicit input arguments). In this case there are no result
10901arguments. The called function is responsible to perform the dereference when
10902storing the result value. Clang terms this *structured return (sret)*.
10903
10904*TODO: correct the ``sret`` definition.*
10905
10906.. TODO::
10907
10908  Is this definition correct? Or is ``sret`` only used if passing in registers, and
10909  pass as non-decomposed struct as stack argument? Or something else? Is the
10910  memory location in the caller stack frame, or a stack memory argument and so
10911  no address is passed as the caller can directly write to the argument stack
10912  location? But then the stack location is still live after return. If an
10913  argument stack location is it the first stack argument or the last one?
10914
10915Lambda argument types are treated as struct types with an implementation defined
10916set of fields.
10917
10918.. TODO::
10919
10920  Need to specify the ABI for lambda types for AMDGPU.
10921
10922For AMDGPU backend all source language arguments (including the decomposed
10923struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
10924they are passed in SGPRs.
10925
10926The AMDGPU backend walks the function call graph from the leaves to determine
10927which implicit input arguments are used, propagating to each caller of the
10928function. The used implicit arguments are appended to the function arguments
10929after the source language arguments in the following order:
10930
10931.. TODO::
10932
10933  Is recursion or external functions supported?
10934
109351.  Work-Item ID (1 VGPR)
10936
10937    The X, Y and Z work-item ID are packed into a single VGRP with the following
10938    layout. Only fields actually used by the function are set. The other bits
10939    are undefined.
10940
10941    The values come from the initial kernel execution state. See
10942    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
10943
10944    .. table:: Work-item implicit argument layout
10945      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
10946
10947      ======= ======= ==============
10948      Bits    Size    Field Name
10949      ======= ======= ==============
10950      9:0     10 bits X Work-Item ID
10951      19:10   10 bits Y Work-Item ID
10952      29:20   10 bits Z Work-Item ID
10953      31:30   2 bits  Unused
10954      ======= ======= ==============
10955
109562.  Dispatch Ptr (2 SGPRs)
10957
10958    The value comes from the initial kernel execution state. See
10959    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10960
109613.  Queue Ptr (2 SGPRs)
10962
10963    The value comes from the initial kernel execution state. See
10964    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10965
109664.  Kernarg Segment Ptr (2 SGPRs)
10967
10968    The value comes from the initial kernel execution state. See
10969    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10970
109715.  Dispatch id (2 SGPRs)
10972
10973    The value comes from the initial kernel execution state. See
10974    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10975
109766.  Work-Group ID X (1 SGPR)
10977
10978    The value comes from the initial kernel execution state. See
10979    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10980
109817.  Work-Group ID Y (1 SGPR)
10982
10983    The value comes from the initial kernel execution state. See
10984    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10985
109868.  Work-Group ID Z (1 SGPR)
10987
10988    The value comes from the initial kernel execution state. See
10989    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10990
109919.  Implicit Argument Ptr (2 SGPRs)
10992
10993    The value is computed by adding an offset to Kernarg Segment Ptr to get the
10994    global address space pointer to the first kernarg implicit argument.
10995
10996The input and result arguments are assigned in order in the following manner:
10997
10998.. note::
10999
11000  There are likely some errors and omissions in the following description that
11001  need correction.
11002
11003  .. TODO::
11004
11005    Check the Clang source code to decipher how function arguments and return
11006    results are handled. Also see the AMDGPU specific values used.
11007
11008* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
11009  VGPR31.
11010
11011  If there are more arguments than will fit in these registers, the remaining
11012  arguments are allocated on the stack in order on naturally aligned
11013  addresses.
11014
11015  .. TODO::
11016
11017    How are overly aligned structures allocated on the stack?
11018
11019* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
11020  SGPR29.
11021
11022  If there are more arguments than will fit in these registers, the remaining
11023  arguments are allocated on the stack in order on naturally aligned
11024  addresses.
11025
11026Note that decomposed struct type arguments may have some fields passed in
11027registers and some in memory.
11028
11029.. TODO::
11030
11031  So, a struct which can pass some fields as decomposed register arguments, will
11032  pass the rest as decomposed stack elements? But an argument that will not start
11033  in registers will not be decomposed and will be passed as a non-decomposed
11034  stack value?
11035
11036The following is not part of the AMDGPU function calling convention but
11037describes how the AMDGPU implements function calls:
11038
110391.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
11040    unswizzled scratch address. It is only needed if runtime sized ``alloca``
11041    are used, or for the reasons defined in ``SIFrameLowering``.
110422.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
11043    to access the incoming stack arguments in the function. The BP is needed
11044    only when the function requires the runtime stack alignment.
11045
110463.  Allocating SGPR arguments on the stack are not supported.
11047
110484.  No CFI is currently generated. See
11049    :ref:`amdgpu-dwarf-call-frame-information`.
11050
11051    .. note::
11052
11053      CFI will be generated that defines the CFA as the unswizzled address
11054      relative to the wave scratch base in the unswizzled private address space
11055      of the lowest address stack allocated local variable.
11056
11057      ``DW_AT_frame_base`` will be defined as the swizzled address in the
11058      swizzled private address space by dividing the CFA by the wavefront size
11059      (since CFA is always at least dword aligned which matches the scratch
11060      swizzle element size).
11061
11062      If no dynamic stack alignment was performed, the stack allocated arguments
11063      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
11064      local variables and register spill slots are accessed as positive offsets
11065      relative to ``DW_AT_frame_base``.
11066
110675.  Function argument passing is implemented by copying the input physical
11068    registers to virtual registers on entry. The register allocator can spill if
11069    necessary. These are copied back to physical registers at call sites. The
11070    net effect is that each function call can have these values in entirely
11071    distinct locations. The IPRA can help avoid shuffling argument registers.
110726.  Call sites are implemented by setting up the arguments at positive offsets
11073    from SP. Then SP is incremented to account for the known frame size before
11074    the call and decremented after the call.
11075
11076    .. note::
11077
11078      The CFI will reflect the changed calculation needed to compute the CFA
11079      from SP.
11080
110817.  4 byte spill slots are used in the stack frame. One slot is allocated for an
11082    emergency spill slot. Buffer instructions are used for stack accesses and
11083    not the ``flat_scratch`` instruction.
11084
11085    .. TODO::
11086
11087      Explain when the emergency spill slot is used.
11088
11089.. TODO::
11090
11091  Possible broken issues:
11092
11093  - Stack arguments must be aligned to required alignment.
11094  - Stack is aligned to max(16, max formal argument alignment)
11095  - Direct argument < 64 bits should check register budget.
11096  - Register budget calculation should respect ``inreg`` for SGPR.
11097  - SGPR overflow is not handled.
11098  - struct with 1 member unpeeling is not checking size of member.
11099  - ``sret`` is after ``this`` pointer.
11100  - Caller is not implementing stack realignment: need an extra pointer.
11101  - Should say AMDGPU passes FP rather than SP.
11102  - Should CFI define CFA as address of locals or arguments. Difference is
11103    apparent when have implemented dynamic alignment.
11104  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
11105    highest address of stack frame and use negative offset for locals. Would
11106    allow SP to be the same as FP and could support signal-handler-like as now
11107    have a real SP for the top of the stack.
11108  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
11109    arguments?
11110
11111AMDPAL
11112------
11113
11114This section provides code conventions used when the target triple OS is
11115``amdpal`` (see :ref:`amdgpu-target-triples`).
11116
11117.. _amdgpu-amdpal-code-object-metadata-section:
11118
11119Code Object Metadata
11120~~~~~~~~~~~~~~~~~~~~
11121
11122.. note::
11123
11124  The metadata is currently in development and is subject to major
11125  changes. Only the current version is supported. *When this document
11126  was generated the version was 2.6.*
11127
11128Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
11129record (see :ref:`amdgpu-note-records-v3-v4`).
11130
11131The metadata is represented as Message Pack formatted binary data (see
11132[MsgPack]_). The top level is a Message Pack map that includes the keys
11133defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
11134and referenced tables.
11135
11136Additional information can be added to the maps. To avoid conflicts, any
11137key names should be prefixed by "*vendor-name*." where ``vendor-name``
11138can be the name of the vendor and specific vendor tool that generates the
11139information. The prefix is abbreviated to simply "." when it appears
11140within a map that has been added by the same *vendor-name*.
11141
11142  .. table:: AMDPAL Code Object Metadata Map
11143     :name: amdgpu-amdpal-code-object-metadata-map-table
11144
11145     =================== ============== ========= ======================================================================
11146     String Key          Value Type     Required? Description
11147     =================== ============== ========= ======================================================================
11148     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
11149                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
11150     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
11151                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
11152                                                  definition of the keys included in that map.
11153     =================== ============== ========= ======================================================================
11154
11155..
11156
11157  .. table:: AMDPAL Code Object Pipeline Metadata Map
11158     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
11159
11160     ====================================== ============== ========= ===================================================
11161     String Key                             Value Type     Required? Description
11162     ====================================== ============== ========= ===================================================
11163     ".name"                                string                   Source name of the pipeline.
11164     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
11165
11166                                                                       - "VsPs"
11167                                                                       - "Gs"
11168                                                                       - "Cs"
11169                                                                       - "Ngg"
11170                                                                       - "Tess"
11171                                                                       - "GsTess"
11172                                                                       - "NggTess"
11173
11174     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
11175                                            2 integers               64 bits is the "stable" portion of the hash, used
11176                                                                     for e.g. shader replacement lookup. Upper 64 bits
11177                                                                     is the "unique" portion of the hash, used for
11178                                                                     e.g. pipeline cache lookup. The value is
11179                                                                     implementation defined, and can not be relied on
11180                                                                     between different builds of the compiler.
11181     ".shaders"                             map                      Per-API shader metadata. See
11182                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
11183                                                                     for the definition of the keys included in that
11184                                                                     map.
11185     ".hardware_stages"                     map                      Per-hardware stage metadata. See
11186                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
11187                                                                     for the definition of the keys included in that
11188                                                                     map.
11189     ".shader_functions"                    map                      Per-shader function metadata. See
11190                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
11191                                                                     for the definition of the keys included in that
11192                                                                     map.
11193     ".registers"                           map            Required  Hardware register configuration. See
11194                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
11195                                                                     for the definition of the keys included in that
11196                                                                     map.
11197     ".user_data_limit"                     integer                  Number of user data entries accessed by this
11198                                                                     pipeline.
11199     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
11200                                                                     NoUserDataSpilling.
11201     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
11202                                                                     viewport array index feature. Pipelines which use
11203                                                                     this feature can render into all 16 viewports,
11204                                                                     whereas pipelines which do not use it are
11205                                                                     restricted to viewport #0.
11206     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
11207                                                                     handling data-passing between the ES and GS
11208                                                                     shader stages. This can be zero if the data is
11209                                                                     passed using off-chip buffers. This value should
11210                                                                     be used to program all user-SGPRs which have been
11211                                                                     marked with "UserDataMapping::EsGsLdsSize"
11212                                                                     (typically only the GS and VS HW stages will ever
11213                                                                     have a user-SGPR so marked).
11214     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
11215                                                                     (maximum number of threads in a subgroup).
11216     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
11217     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
11218     ".api"                                 string                   Name of the client graphics API.
11219     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
11220                                                                     be defined by the driver using the compiler if
11221                                                                     they want to be able to correlate API-specific
11222                                                                     information used during creation at a later time.
11223     ====================================== ============== ========= ===================================================
11224
11225..
11226
11227  .. table:: AMDPAL Code Object Shader Map
11228     :name: amdgpu-amdpal-code-object-shader-map-table
11229
11230
11231     +-------------+--------------+-------------------------------------------------------------------+
11232     |String Key   |Value Type    |Description                                                        |
11233     +=============+==============+===================================================================+
11234     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
11235     |- ".vertex"  |              |for the definition of the keys included in that map.               |
11236     |- ".hull"    |              |                                                                   |
11237     |- ".domain"  |              |                                                                   |
11238     |- ".geometry"|              |                                                                   |
11239     |- ".pixel"   |              |                                                                   |
11240     +-------------+--------------+-------------------------------------------------------------------+
11241
11242..
11243
11244  .. table:: AMDPAL Code Object API Shader Metadata Map
11245     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
11246
11247     ==================== ============== ========= =====================================================================
11248     String Key           Value Type     Required? Description
11249     ==================== ============== ========= =====================================================================
11250     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
11251                          2 integers               is implementation defined, and can not be relied on between
11252                                                   different builds of the compiler.
11253     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
11254                          string                   include:
11255
11256                                                     - ".ls"
11257                                                     - ".hs"
11258                                                     - ".es"
11259                                                     - ".gs"
11260                                                     - ".vs"
11261                                                     - ".ps"
11262                                                     - ".cs"
11263
11264     ==================== ============== ========= =====================================================================
11265
11266..
11267
11268  .. table:: AMDPAL Code Object Hardware Stage Map
11269     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
11270
11271     +-------------+--------------+-----------------------------------------------------------------------+
11272     |String Key   |Value Type    |Description                                                            |
11273     +=============+==============+=======================================================================+
11274     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
11275     |- ".hs"      |              |for the definition of the keys included in that map.                   |
11276     |- ".es"      |              |                                                                       |
11277     |- ".gs"      |              |                                                                       |
11278     |- ".vs"      |              |                                                                       |
11279     |- ".ps"      |              |                                                                       |
11280     |- ".cs"      |              |                                                                       |
11281     +-------------+--------------+-----------------------------------------------------------------------+
11282
11283..
11284
11285  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
11286     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
11287
11288     ========================== ============== ========= ===============================================================
11289     String Key                 Value Type     Required? Description
11290     ========================== ============== ========= ===============================================================
11291     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
11292     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
11293     ".lds_size"                integer                  Local Data Share size in bytes.
11294     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
11295     ".vgpr_count"              integer                  Number of VGPRs used.
11296     ".sgpr_count"              integer                  Number of SGPRs used.
11297     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
11298                                                         directive to instruct the compiler to limit the VGPR usage to
11299                                                         be less than or equal to the specified value (only set if
11300                                                         different from HW default).
11301     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
11302                                                         default).
11303     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
11304                                3 integers
11305     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
11306     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
11307     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
11308     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
11309     ".writes_depth"            boolean                  The shader writes out a depth value.
11310     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
11311                                                         memory or GDS.
11312     ".uses_prim_id"            boolean                  The shader uses PrimID.
11313     ========================== ============== ========= ===============================================================
11314
11315..
11316
11317  .. table:: AMDPAL Code Object Shader Function Map
11318     :name: amdgpu-amdpal-code-object-shader-function-map-table
11319
11320     =============== ============== ====================================================================
11321     String Key      Value Type     Description
11322     =============== ============== ====================================================================
11323     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
11324                                    entry address. The value is the function's metadata. See
11325                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
11326     =============== ============== ====================================================================
11327
11328..
11329
11330  .. table:: AMDPAL Code Object Shader Function Metadata Map
11331     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
11332
11333     ============================= ============== =================================================================
11334     String Key                    Value Type     Description
11335     ============================= ============== =================================================================
11336     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
11337                                   2 integers     is implementation defined, and can not be relied on between
11338                                                  different builds of the compiler.
11339     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
11340     ".lds_size"                   integer        Size in bytes of LDS memory.
11341     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
11342     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
11343     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
11344     ".shader_subtype"             string         Shader subtype/kind. Values include:
11345
11346                                                    - "Unknown"
11347
11348     ============================= ============== =================================================================
11349
11350..
11351
11352  .. table:: AMDPAL Code Object Register Map
11353     :name: amdgpu-amdpal-code-object-register-map-table
11354
11355     ========================== ============== ====================================================================
11356     32-bit Integer Key         Value Type     Description
11357     ========================== ============== ====================================================================
11358     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
11359                                               a GRBM register (i.e., driver accessible GPU register number, not
11360                                               shader GPR register number). The driver is required to program each
11361                                               specified register to the corresponding specified value when
11362                                               executing this pipeline. Typically, the ``reg offsets`` are the
11363                                               ``uint16_t`` offsets to each register as defined by the hardware
11364                                               chip headers. The register is set to the provided value. However, a
11365                                               ``reg offset`` that specifies a user data register (e.g.,
11366                                               COMPUTE_USER_DATA_0) needs special treatment. See
11367                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
11368                                               information.
11369     ========================== ============== ====================================================================
11370
11371.. _amdgpu-amdpal-code-object-user-data-section:
11372
11373User Data
11374+++++++++
11375
11376Each hardware stage has a set of 32-bit physical SPI *user data registers*
11377(either 16 or 32 based on graphics IP and the stage) which can be
11378written from a command buffer and then loaded into SGPRs when waves are
11379launched via a subsequent dispatch or draw operation. This is the way
11380most arguments are passed from the application/runtime to a hardware
11381shader.
11382
11383PAL abstracts this functionality by exposing a set of 128 *user data
11384entries* per pipeline a client can use to pass arguments from a command
11385buffer to one or more shaders in that pipeline. The ELF code object must
11386specify a mapping from virtualized *user data entries* to physical *user
11387data registers*, and PAL is responsible for implementing that mapping,
11388including spilling overflow *user data entries* to memory if needed.
11389
11390Since the *user data registers* are GRBM-accessible SPI registers, this
11391mapping is actually embedded in the ``.registers`` metadata entry. For
11392most registers, the value in that map is a literal 32-bit value that
11393should be written to the register by the driver. However, when the
11394register is a *user data register* (any USER_DATA register e.g.,
11395SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
11396the driver to write either a *user data entry* value or one of several
11397driver-internal values to the register. This encoding is described in
11398the following table:
11399
11400.. note::
11401
11402  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
11403  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
11404  always be programmed to the address of the GlobalTable, and *user data
11405  register* 1 must always be programmed to the address of the PerShaderTable.
11406
11407..
11408
11409  .. table:: AMDPAL User Data Mapping
11410     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
11411
11412     ==========  =================  ===============================================================================
11413     Value       Name               Description
11414     ==========  =================  ===============================================================================
11415     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
11416     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
11417                                    always point to *user data register* 0).
11418     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
11419                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
11420                                    for more detail (should always point to *user data register* 1).
11421     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
11422                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
11423                                    more detail.
11424     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
11425                                    reference the draw index in the vertex shader. Only supported by the first
11426                                    stage in a graphics pipeline.
11427     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
11428                                    a graphics pipeline.
11429     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
11430                                    graphics pipeline.
11431     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
11432                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
11433                                    high half of the address is stored in the next sequential user-SGPR. Only
11434                                    supported by compute pipelines.
11435     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
11436                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
11437                                    stages.
11438     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
11439                                    pipeline instancing.
11440     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
11441                                    can only appear for one shader stage per pipeline.
11442     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
11443     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
11444                                    only appear for one shader stage per pipeline.
11445     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
11446                                    only appear for one shader stage per pipeline (PS). These replace color targets
11447                                    and are completely separate from any UAVs used by the shader. This is optional,
11448                                    and only used by the PS when UAV exports are used to replace color-target
11449                                    exports to optimize specific shaders.
11450     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
11451                                    some NGG pipelines to perform culling.  This value contains the address of the
11452                                    first of two consecutive registers which provide the full GPU address.
11453     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
11454     ==========  =================  ===============================================================================
11455
11456.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
11457
11458Per-Shader Table
11459################
11460
11461Low 32 bits of the GPU address for an optional buffer in the ``.data``
11462section of the ELF. The high 32 bits of the address match the high 32 bits
11463of the shader's program counter.
11464
11465The buffer can be anything the shader compiler needs it for, and
11466allows each shader to have its own region of the ``.data`` section.
11467Typically, this could be a table of buffer SRD's and the data pointed to
11468by the buffer SRD's, but it could be a flat-address region of memory as
11469well. Its layout and usage are defined by the shader compiler.
11470
11471Each shader's table in the ``.data`` section is referenced by the symbol
11472``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
11473hardware shader stage the data is for. E.g.,
11474``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
11475
11476.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
11477
11478Spill Table
11479###########
11480
11481It is possible for a hardware shader to need access to more *user data
11482entries* than there are slots available in user data registers for one
11483or more hardware shader stages. In that case, the PAL runtime expects
11484the necessary *user data entries* to be spilled to GPU memory and use
11485one user data register to point to the spilled user data memory. The
11486value of the *user data entry* must then represent the location where
11487a shader expects to read the low 32-bits of the table's GPU virtual
11488address. The *spill table* itself represents a set of 32-bit values
11489managed by the PAL runtime in GPU-accessible memory that can be made
11490indirectly accessible to a hardware shader.
11491
11492Unspecified OS
11493--------------
11494
11495This section provides code conventions used when the target triple OS is
11496empty (see :ref:`amdgpu-target-triples`).
11497
11498Trap Handler ABI
11499~~~~~~~~~~~~~~~~
11500
11501For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
11502not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
11503instructions are handled as follows:
11504
11505  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
11506     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
11507
11508     =============== =============== ===========================================
11509     Usage           Code Sequence   Description
11510     =============== =============== ===========================================
11511     llvm.trap       s_endpgm        Causes wavefront to be terminated.
11512     llvm.debugtrap  *none*          Compiler warning given that there is no
11513                                     trap handler installed.
11514     =============== =============== ===========================================
11515
11516Source Languages
11517================
11518
11519.. _amdgpu-opencl:
11520
11521OpenCL
11522------
11523
11524When the language is OpenCL the following differences occur:
11525
115261. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
115272. The AMDGPU backend appends additional arguments to the kernel's explicit
11528   arguments for the AMDHSA OS (see
11529   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
115303. Additional metadata is generated
11531   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
11532
11533  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
11534     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
11535
11536     ======== ==== ========= ===========================================
11537     Position Byte Byte      Description
11538              Size Alignment
11539     ======== ==== ========= ===========================================
11540     1        8    8         OpenCL Global Offset X
11541     2        8    8         OpenCL Global Offset Y
11542     3        8    8         OpenCL Global Offset Z
11543     4        8    8         OpenCL address of printf buffer
11544     5        8    8         OpenCL address of virtual queue used by
11545                             enqueue_kernel.
11546     6        8    8         OpenCL address of AqlWrap struct used by
11547                             enqueue_kernel.
11548     7        8    8         Pointer argument used for Multi-gird
11549                             synchronization.
11550     ======== ==== ========= ===========================================
11551
11552.. _amdgpu-hcc:
11553
11554HCC
11555---
11556
11557When the language is HCC the following differences occur:
11558
115591. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11560
11561.. _amdgpu-assembler:
11562
11563Assembler
11564---------
11565
11566AMDGPU backend has LLVM-MC based assembler which is currently in development.
11567It supports AMDGCN GFX6-GFX10.
11568
11569This section describes general syntax for instructions and operands.
11570
11571Instructions
11572~~~~~~~~~~~~
11573
11574An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
11575
11576  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
11577    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
11578
11579:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
11580:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
11581
11582The order of operands and modifiers is fixed.
11583Most modifiers are optional and may be omitted.
11584
11585Links to detailed instruction syntax description may be found in the following
11586table. Note that features under development are not included
11587in this description.
11588
11589    =================================== =======================================
11590    Core ISA                            ISA Extensions
11591    =================================== =======================================
11592    :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
11593    :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
11594    :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
11595
11596                                        :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
11597
11598                                        :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
11599
11600                                        :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
11601
11602                                        :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
11603
11604                                        :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
11605
11606                                        :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
11607
11608    :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
11609
11610                                        :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
11611    =================================== =======================================
11612
11613For more information about instructions, their semantics and supported
11614combinations of operands, refer to one of instruction set architecture manuals
11615[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
11616[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
11617[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
11618
11619Operands
11620~~~~~~~~
11621
11622Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
11623
11624Modifiers
11625~~~~~~~~~
11626
11627Detailed description of modifiers may be found
11628:doc:`here<AMDGPUModifierSyntax>`.
11629
11630Instruction Examples
11631~~~~~~~~~~~~~~~~~~~~
11632
11633DS
11634++
11635
11636.. code-block:: nasm
11637
11638  ds_add_u32 v2, v4 offset:16
11639  ds_write_src2_b64 v2 offset0:4 offset1:8
11640  ds_cmpst_f32 v2, v4, v6
11641  ds_min_rtn_f64 v[8:9], v2, v[4:5]
11642
11643For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
11644Manual.
11645
11646FLAT
11647++++
11648
11649.. code-block:: nasm
11650
11651  flat_load_dword v1, v[3:4]
11652  flat_store_dwordx3 v[3:4], v[5:7]
11653  flat_atomic_swap v1, v[3:4], v5 glc
11654  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
11655  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
11656
11657For full list of supported instructions, refer to "FLAT instructions" in ISA
11658Manual.
11659
11660MUBUF
11661+++++
11662
11663.. code-block:: nasm
11664
11665  buffer_load_dword v1, off, s[4:7], s1
11666  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
11667  buffer_store_format_xy v[1:2], off, s[4:7], s1
11668  buffer_wbinvl1
11669  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
11670
11671For full list of supported instructions, refer to "MUBUF Instructions" in ISA
11672Manual.
11673
11674SMRD/SMEM
11675+++++++++
11676
11677.. code-block:: nasm
11678
11679  s_load_dword s1, s[2:3], 0xfc
11680  s_load_dwordx8 s[8:15], s[2:3], s4
11681  s_load_dwordx16 s[88:103], s[2:3], s4
11682  s_dcache_inv_vol
11683  s_memtime s[4:5]
11684
11685For full list of supported instructions, refer to "Scalar Memory Operations" in
11686ISA Manual.
11687
11688SOP1
11689++++
11690
11691.. code-block:: nasm
11692
11693  s_mov_b32 s1, s2
11694  s_mov_b64 s[0:1], 0x80000000
11695  s_cmov_b32 s1, 200
11696  s_wqm_b64 s[2:3], s[4:5]
11697  s_bcnt0_i32_b64 s1, s[2:3]
11698  s_swappc_b64 s[2:3], s[4:5]
11699  s_cbranch_join s[4:5]
11700
11701For full list of supported instructions, refer to "SOP1 Instructions" in ISA
11702Manual.
11703
11704SOP2
11705++++
11706
11707.. code-block:: nasm
11708
11709  s_add_u32 s1, s2, s3
11710  s_and_b64 s[2:3], s[4:5], s[6:7]
11711  s_cselect_b32 s1, s2, s3
11712  s_andn2_b32 s2, s4, s6
11713  s_lshr_b64 s[2:3], s[4:5], s6
11714  s_ashr_i32 s2, s4, s6
11715  s_bfm_b64 s[2:3], s4, s6
11716  s_bfe_i64 s[2:3], s[4:5], s6
11717  s_cbranch_g_fork s[4:5], s[6:7]
11718
11719For full list of supported instructions, refer to "SOP2 Instructions" in ISA
11720Manual.
11721
11722SOPC
11723++++
11724
11725.. code-block:: nasm
11726
11727  s_cmp_eq_i32 s1, s2
11728  s_bitcmp1_b32 s1, s2
11729  s_bitcmp0_b64 s[2:3], s4
11730  s_setvskip s3, s5
11731
11732For full list of supported instructions, refer to "SOPC Instructions" in ISA
11733Manual.
11734
11735SOPP
11736++++
11737
11738.. code-block:: nasm
11739
11740  s_barrier
11741  s_nop 2
11742  s_endpgm
11743  s_waitcnt 0 ; Wait for all counters to be 0
11744  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
11745  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
11746  s_sethalt 9
11747  s_sleep 10
11748  s_sendmsg 0x1
11749  s_sendmsg sendmsg(MSG_INTERRUPT)
11750  s_trap 1
11751
11752For full list of supported instructions, refer to "SOPP Instructions" in ISA
11753Manual.
11754
11755Unless otherwise mentioned, little verification is performed on the operands
11756of SOPP Instructions, so it is up to the programmer to be familiar with the
11757range or acceptable values.
11758
11759VALU
11760++++
11761
11762For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
11763the assembler will automatically use optimal encoding based on its operands. To
11764force specific encoding, one can add a suffix to the opcode of the instruction:
11765
11766* _e32 for 32-bit VOP1/VOP2/VOPC
11767* _e64 for 64-bit VOP3
11768* _dpp for VOP_DPP
11769* _sdwa for VOP_SDWA
11770
11771VOP1/VOP2/VOP3/VOPC examples:
11772
11773.. code-block:: nasm
11774
11775  v_mov_b32 v1, v2
11776  v_mov_b32_e32 v1, v2
11777  v_nop
11778  v_cvt_f64_i32_e32 v[1:2], v2
11779  v_floor_f32_e32 v1, v2
11780  v_bfrev_b32_e32 v1, v2
11781  v_add_f32_e32 v1, v2, v3
11782  v_mul_i32_i24_e64 v1, v2, 3
11783  v_mul_i32_i24_e32 v1, -3, v3
11784  v_mul_i32_i24_e32 v1, -100, v3
11785  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
11786  v_max_f16_e32 v1, v2, v3
11787
11788VOP_DPP examples:
11789
11790.. code-block:: nasm
11791
11792  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
11793  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11794  v_mov_b32 v0, v0 wave_shl:1
11795  v_mov_b32 v0, v0 row_mirror
11796  v_mov_b32 v0, v0 row_bcast:31
11797  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
11798  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11799  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11800
11801VOP_SDWA examples:
11802
11803.. code-block:: nasm
11804
11805  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
11806  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
11807  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
11808  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
11809  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
11810
11811For full list of supported instructions, refer to "Vector ALU instructions".
11812
11813.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
11814
11815Code Object V2 Predefined Symbols
11816~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11817
11818.. warning::
11819  Code object V2 is not the default code object version emitted by
11820  this version of LLVM.
11821
11822The AMDGPU assembler defines and updates some symbols automatically. These
11823symbols do not affect code generation.
11824
11825.option.machine_version_major
11826+++++++++++++++++++++++++++++
11827
11828Set to the GFX major generation number of the target being assembled for. For
11829example, when assembling for a "GFX9" target this will be set to the integer
11830value "9". The possible GFX major generation numbers are presented in
11831:ref:`amdgpu-processors`.
11832
11833.option.machine_version_minor
11834+++++++++++++++++++++++++++++
11835
11836Set to the GFX minor generation number of the target being assembled for. For
11837example, when assembling for a "GFX810" target this will be set to the integer
11838value "1". The possible GFX minor generation numbers are presented in
11839:ref:`amdgpu-processors`.
11840
11841.option.machine_version_stepping
11842++++++++++++++++++++++++++++++++
11843
11844Set to the GFX stepping generation number of the target being assembled for.
11845For example, when assembling for a "GFX704" target this will be set to the
11846integer value "4". The possible GFX stepping generation numbers are presented
11847in :ref:`amdgpu-processors`.
11848
11849.kernel.vgpr_count
11850++++++++++++++++++
11851
11852Set to zero each time a
11853:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11854encountered. At each instruction, if the current value of this symbol is less
11855than or equal to the maximum VGPR number explicitly referenced within that
11856instruction then the symbol value is updated to equal that VGPR number plus
11857one.
11858
11859.kernel.sgpr_count
11860++++++++++++++++++
11861
11862Set to zero each time a
11863:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11864encountered. At each instruction, if the current value of this symbol is less
11865than or equal to the maximum VGPR number explicitly referenced within that
11866instruction then the symbol value is updated to equal that SGPR number plus
11867one.
11868
11869.. _amdgpu-amdhsa-assembler-directives-v2:
11870
11871Code Object V2 Directives
11872~~~~~~~~~~~~~~~~~~~~~~~~~
11873
11874.. warning::
11875  Code object V2 is not the default code object version emitted by
11876  this version of LLVM.
11877
11878AMDGPU ABI defines auxiliary data in output code object. In assembly source,
11879one can specify them with assembler directives.
11880
11881.hsa_code_object_version major, minor
11882+++++++++++++++++++++++++++++++++++++
11883
11884*major* and *minor* are integers that specify the version of the HSA code
11885object that will be generated by the assembler.
11886
11887.hsa_code_object_isa [major, minor, stepping, vendor, arch]
11888+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11889
11890
11891*major*, *minor*, and *stepping* are all integers that describe the instruction
11892set architecture (ISA) version of the assembly program.
11893
11894*vendor* and *arch* are quoted strings. *vendor* should always be equal to
11895"AMD" and *arch* should always be equal to "AMDGPU".
11896
11897By default, the assembler will derive the ISA version, *vendor*, and *arch*
11898from the value of the -mcpu option that is passed to the assembler.
11899
11900.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
11901
11902.amdgpu_hsa_kernel (name)
11903+++++++++++++++++++++++++
11904
11905This directives specifies that the symbol with given name is a kernel entry
11906point (label) and the object should contain corresponding symbol of type
11907STT_AMDGPU_HSA_KERNEL.
11908
11909.amd_kernel_code_t
11910++++++++++++++++++
11911
11912This directive marks the beginning of a list of key / value pairs that are used
11913to specify the amd_kernel_code_t object that will be emitted by the assembler.
11914The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
11915amd_kernel_code_t values that are unspecified a default value will be used. The
11916default value for all keys is 0, with the following exceptions:
11917
11918- *amd_code_version_major* defaults to 1.
11919- *amd_kernel_code_version_minor* defaults to 2.
11920- *amd_machine_kind* defaults to 1.
11921- *amd_machine_version_major*, *machine_version_minor*, and
11922  *amd_machine_version_stepping* are derived from the value of the -mcpu option
11923  that is passed to the assembler.
11924- *kernel_code_entry_byte_offset* defaults to 256.
11925- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
11926  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
11927  Note that wavefront size is specified as a power of two, so a value of **n**
11928  means a size of 2^ **n**.
11929- *call_convention* defaults to -1.
11930- *kernarg_segment_alignment*, *group_segment_alignment*, and
11931  *private_segment_alignment* default to 4. Note that alignments are specified
11932  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
11933- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
11934  GFX90A onwards.
11935- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
11936  GFX10 onwards.
11937- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
11938
11939The *.amd_kernel_code_t* directive must be placed immediately after the
11940function label and before any instructions.
11941
11942For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
11943comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
11944
11945.. _amdgpu-amdhsa-assembler-example-v2:
11946
11947Code Object V2 Example Source Code
11948~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11949
11950.. warning::
11951  Code Object V2 is not the default code object version emitted by
11952  this version of LLVM.
11953
11954Here is an example of a minimal assembly source file, defining one HSA kernel:
11955
11956.. code::
11957   :number-lines:
11958
11959   .hsa_code_object_version 1,0
11960   .hsa_code_object_isa
11961
11962   .hsatext
11963   .globl  hello_world
11964   .p2align 8
11965   .amdgpu_hsa_kernel hello_world
11966
11967   hello_world:
11968
11969      .amd_kernel_code_t
11970         enable_sgpr_kernarg_segment_ptr = 1
11971         is_ptr64 = 1
11972         compute_pgm_rsrc1_vgprs = 0
11973         compute_pgm_rsrc1_sgprs = 0
11974         compute_pgm_rsrc2_user_sgpr = 2
11975         compute_pgm_rsrc1_wgp_mode = 0
11976         compute_pgm_rsrc1_mem_ordered = 0
11977         compute_pgm_rsrc1_fwd_progress = 1
11978     .end_amd_kernel_code_t
11979
11980     s_load_dwordx2 s[0:1], s[0:1] 0x0
11981     v_mov_b32 v0, 3.14159
11982     s_waitcnt lgkmcnt(0)
11983     v_mov_b32 v1, s0
11984     v_mov_b32 v2, s1
11985     flat_store_dword v[1:2], v0
11986     s_endpgm
11987   .Lfunc_end0:
11988        .size   hello_world, .Lfunc_end0-hello_world
11989
11990.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4:
11991
11992Code Object V3 to V4 Predefined Symbols
11993~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11994
11995The AMDGPU assembler defines and updates some symbols automatically. These
11996symbols do not affect code generation.
11997
11998.amdgcn.gfx_generation_number
11999+++++++++++++++++++++++++++++
12000
12001Set to the GFX major generation number of the target being assembled for. For
12002example, when assembling for a "GFX9" target this will be set to the integer
12003value "9". The possible GFX major generation numbers are presented in
12004:ref:`amdgpu-processors`.
12005
12006.amdgcn.gfx_generation_minor
12007++++++++++++++++++++++++++++
12008
12009Set to the GFX minor generation number of the target being assembled for. For
12010example, when assembling for a "GFX810" target this will be set to the integer
12011value "1". The possible GFX minor generation numbers are presented in
12012:ref:`amdgpu-processors`.
12013
12014.amdgcn.gfx_generation_stepping
12015+++++++++++++++++++++++++++++++
12016
12017Set to the GFX stepping generation number of the target being assembled for.
12018For example, when assembling for a "GFX704" target this will be set to the
12019integer value "4". The possible GFX stepping generation numbers are presented
12020in :ref:`amdgpu-processors`.
12021
12022.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
12023
12024.amdgcn.next_free_vgpr
12025++++++++++++++++++++++
12026
12027Set to zero before assembly begins. At each instruction, if the current value
12028of this symbol is less than or equal to the maximum VGPR number explicitly
12029referenced within that instruction then the symbol value is updated to equal
12030that VGPR number plus one.
12031
12032May be used to set the `.amdhsa_next_free_vgpr` directive in
12033:ref:`amdhsa-kernel-directives-table`.
12034
12035May be set at any time, e.g. manually set to zero at the start of each kernel.
12036
12037.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
12038
12039.amdgcn.next_free_sgpr
12040++++++++++++++++++++++
12041
12042Set to zero before assembly begins. At each instruction, if the current value
12043of this symbol is less than or equal the maximum SGPR number explicitly
12044referenced within that instruction then the symbol value is updated to equal
12045that SGPR number plus one.
12046
12047May be used to set the `.amdhsa_next_free_spgr` directive in
12048:ref:`amdhsa-kernel-directives-table`.
12049
12050May be set at any time, e.g. manually set to zero at the start of each kernel.
12051
12052.. _amdgpu-amdhsa-assembler-directives-v3-v4:
12053
12054Code Object V3 to V4 Directives
12055~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12056
12057Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
12058architecture processors, and are not OS-specific. Directives which begin with
12059``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
12060``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
12061:ref:`amdgpu-processors`.
12062
12063.. _amdgpu-assembler-directive-amdgcn-target:
12064
12065.amdgcn_target <target-triple> "-" <target-id>
12066++++++++++++++++++++++++++++++++++++++++++++++
12067
12068Optional directive which declares the ``<target-triple>-<target-id>`` supported
12069by the containing assembler source file. Used by the assembler to validate
12070command-line options such as ``-triple``, ``-mcpu``, and
12071``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
12072:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
12073
12074.. note::
12075
12076  The target ID syntax used for code object V2 to V3 for this directive differs
12077  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
12078
12079.amdhsa_kernel <name>
12080+++++++++++++++++++++
12081
12082Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
12083``<name>.kd``, in the current location of the current section. Only valid when
12084the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
12085instruction to execute, and does not need to be previously defined.
12086
12087Marks the beginning of a list of directives used to generate the bytes of a
12088kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
12089Directives which may appear in this list are described in
12090:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
12091be valid for the target being assembled for, and cannot be repeated. Directives
12092support the range of values specified by the field they reference in
12093:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
12094assumed to have its default value, unless it is marked as "Required", in which
12095case it is an error to omit the directive. This list of directives is
12096terminated by an ``.end_amdhsa_kernel`` directive.
12097
12098  .. table:: AMDHSA Kernel Assembler Directives
12099     :name: amdhsa-kernel-directives-table
12100
12101     ======================================================== =================== ============ ===================
12102     Directive                                                Default             Supported On Description
12103     ======================================================== =================== ============ ===================
12104     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
12105                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12106     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
12107                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12108     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX10   Controls KERNARG_SIZE in
12109                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12110     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
12111                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12112     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
12113                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12114     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
12115                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12116     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
12117                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12118     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
12119                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12120     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
12121                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12122     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
12123                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12124     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
12125                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12126                                                              Specific
12127                                                              (wavefrontsize64)
12128     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
12129                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12130     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
12131                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12132     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
12133                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12134     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
12135                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12136     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
12137                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12138     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
12139                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12140                                                                                               Possible values are defined in
12141                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
12142     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
12143                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
12144                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12145     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
12146                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12147                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12148     ``.amdhsa_accum_offset``                                 Required            GFX90A       Offset of a first AccVGPR in the unified register file.
12149                                                                                               Used to calculate ACCUM_OFFSET in
12150                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12151     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
12152                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12153                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12154     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
12155                                                                                               scratch memory. Used to calculate
12156                                                                                               GRANULATED_WAVEFRONT_SGPR_COUNT in
12157                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12158     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
12159                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12160                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12161                                                              (xnack)
12162     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
12163                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12164                                                                                               Possible values are defined in
12165                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12166     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
12167                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12168                                                                                               Possible values are defined in
12169                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12170     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
12171                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12172                                                                                               Possible values are defined in
12173                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12174     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
12175                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12176                                                                                               Possible values are defined in
12177                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12178     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
12179                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12180     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
12181                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12182     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
12183                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12184     ``.amdhsa_tg_split``                                     Target              GFX90A       Controls TG_SPLIT in
12185                                                              Feature                          :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12186                                                              Specific
12187                                                              (tgsplit)
12188     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
12189                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12190                                                              Specific
12191                                                              (cumode)
12192     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
12193                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12194     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
12195                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12196     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
12197                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12198     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
12199                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12200     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
12201                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12202     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
12203                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12204     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
12205                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12206     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
12207                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12208     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
12209                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12210     ======================================================== =================== ============ ===================
12211
12212.amdgpu_metadata
12213++++++++++++++++
12214
12215Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
12216note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`).
12217
12218The contents must be in the [YAML]_ markup format, with the same structure and
12219semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or
12220:ref:`amdgpu-amdhsa-code-object-metadata-v4`.
12221
12222This directive is terminated by an ``.end_amdgpu_metadata`` directive.
12223
12224.. _amdgpu-amdhsa-assembler-example-v3-v4:
12225
12226Code Object V3 to V4 Example Source Code
12227~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12228
12229Here is an example of a minimal assembly source file, defining one HSA kernel:
12230
12231.. code::
12232   :number-lines:
12233
12234   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12235
12236   .text
12237   .globl hello_world
12238   .p2align 8
12239   .type hello_world,@function
12240   hello_world:
12241     s_load_dwordx2 s[0:1], s[0:1] 0x0
12242     v_mov_b32 v0, 3.14159
12243     s_waitcnt lgkmcnt(0)
12244     v_mov_b32 v1, s0
12245     v_mov_b32 v2, s1
12246     flat_store_dword v[1:2], v0
12247     s_endpgm
12248   .Lfunc_end0:
12249     .size   hello_world, .Lfunc_end0-hello_world
12250
12251   .rodata
12252   .p2align 6
12253   .amdhsa_kernel hello_world
12254     .amdhsa_user_sgpr_kernarg_segment_ptr 1
12255     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12256     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12257   .end_amdhsa_kernel
12258
12259   .amdgpu_metadata
12260   ---
12261   amdhsa.version:
12262     - 1
12263     - 0
12264   amdhsa.kernels:
12265     - .name: hello_world
12266       .symbol: hello_world.kd
12267       .kernarg_segment_size: 48
12268       .group_segment_fixed_size: 0
12269       .private_segment_fixed_size: 0
12270       .kernarg_segment_align: 4
12271       .wavefront_size: 64
12272       .sgpr_count: 2
12273       .vgpr_count: 3
12274       .max_flat_workgroup_size: 256
12275       .args:
12276         - .size: 8
12277           .offset: 0
12278           .value_kind: global_buffer
12279           .address_space: global
12280           .actual_access: write_only
12281   //...
12282   .end_amdgpu_metadata
12283
12284This kernel is equivalent to the following HIP program:
12285
12286.. code::
12287   :number-lines:
12288
12289   __global__ void hello_world(float *p) {
12290       *p = 3.14159f;
12291   }
12292
12293If an assembly source file contains multiple kernels and/or functions, the
12294:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
12295:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
12296the ``.set <symbol>, <expression>`` directive. For example, in the case of two
12297kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
12298to group the function with the kernel that calls it and reset the symbols
12299between the two connected components:
12300
12301.. code::
12302   :number-lines:
12303
12304   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12305
12306   // gpr tracking symbols are implicitly set to zero
12307
12308   .text
12309   .globl kern0
12310   .p2align 8
12311   .type kern0,@function
12312   kern0:
12313     // ...
12314     s_endpgm
12315   .Lkern0_end:
12316     .size   kern0, .Lkern0_end-kern0
12317
12318   .rodata
12319   .p2align 6
12320   .amdhsa_kernel kern0
12321     // ...
12322     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12323     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12324   .end_amdhsa_kernel
12325
12326   // reset symbols to begin tracking usage in func1 and kern1
12327   .set .amdgcn.next_free_vgpr, 0
12328   .set .amdgcn.next_free_sgpr, 0
12329
12330   .text
12331   .hidden func1
12332   .global func1
12333   .p2align 2
12334   .type func1,@function
12335   func1:
12336     // ...
12337     s_setpc_b64 s[30:31]
12338   .Lfunc1_end:
12339   .size func1, .Lfunc1_end-func1
12340
12341   .globl kern1
12342   .p2align 8
12343   .type kern1,@function
12344   kern1:
12345     // ...
12346     s_getpc_b64 s[4:5]
12347     s_add_u32 s4, s4, func1@rel32@lo+4
12348     s_addc_u32 s5, s5, func1@rel32@lo+4
12349     s_swappc_b64 s[30:31], s[4:5]
12350     // ...
12351     s_endpgm
12352   .Lkern1_end:
12353     .size   kern1, .Lkern1_end-kern1
12354
12355   .rodata
12356   .p2align 6
12357   .amdhsa_kernel kern1
12358     // ...
12359     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12360     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12361   .end_amdhsa_kernel
12362
12363These symbols cannot identify connected components in order to automatically
12364track the usage for each kernel. However, in some cases careful organization of
12365the kernels and functions in the source file means there is minimal additional
12366effort required to accurately calculate GPR usage.
12367
12368Additional Documentation
12369========================
12370
12371.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
12372.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
12373.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
12374.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
12375.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
12376.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
12377.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
12378.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
12379.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
12380.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
12381.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
12382.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
12383.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
12384.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
12385.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
12386.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
12387.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
12388.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
12389.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
12390.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
12391.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
12392.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
12393.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
12394.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
12395