1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX10
20   AMDGPU/AMDGPUAsmGFX1011
21   AMDGPUModifierSyntax
22   AMDGPUOperandSyntax
23   AMDGPUInstructionSyntax
24   AMDGPUInstructionNotation
25   AMDGPUDwarfExtensionsForHeterogeneousDebugging
26   AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
27
28Introduction
29============
30
31The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
32R600 family up until the current GCN families. It lives in the
33``llvm/lib/Target/AMDGPU`` directory.
34
35LLVM
36====
37
38.. _amdgpu-target-triples:
39
40Target Triples
41--------------
42
43Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
44to specify the target triple:
45
46  .. table:: AMDGPU Architectures
47     :name: amdgpu-architecture-table
48
49     ============ ==============================================================
50     Architecture Description
51     ============ ==============================================================
52     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
53     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
54     ============ ==============================================================
55
56  .. table:: AMDGPU Vendors
57     :name: amdgpu-vendor-table
58
59     ============ ==============================================================
60     Vendor       Description
61     ============ ==============================================================
62     ``amd``      Can be used for all AMD GPU usage.
63     ``mesa3d``   Can be used if the OS is ``mesa3d``.
64     ============ ==============================================================
65
66  .. table:: AMDGPU Operating Systems
67     :name: amdgpu-os
68
69     ============== ============================================================
70     OS             Description
71     ============== ============================================================
72     *<empty>*      Defaults to the *unknown* OS.
73     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
74                    such as:
75
76                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
77                      loader on Linux. See *AMD ROCm Platform Release Notes*
78                      [AMD-ROCm-Release-Notes]_ for supported hardware and
79                      software.
80                    - AMD's PAL runtime using the *pal-amdhsa* loader on
81                      Windows.
82
83     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
84                    runtime using the *pal-amdpal* loader on Windows and Linux
85                    Pro.
86     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
87                    3D runtime using the *mesa-mesa3d* loader on Linux.
88     ============== ============================================================
89
90  .. table:: AMDGPU Environments
91     :name: amdgpu-environment-table
92
93     ============ ==============================================================
94     Environment  Description
95     ============ ==============================================================
96     *<empty>*    Default.
97     ============ ==============================================================
98
99.. _amdgpu-processors:
100
101Processors
102----------
103
104Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
105specify the AMDGPU processor together with optional target features. See
106:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
107specific information.
108
109Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
110
111* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
112
113
114  .. table:: AMDGPU Processors
115     :name: amdgpu-processor-table
116
117     =========== =============== ============ ===== ================= =============== =============== ======================
118     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
119                 Processor       Triple       APU   Features          Properties      *(see*          Products
120                                 Architecture       Supported                         `amdgpu-os`_
121                                                                                      *and
122                                                                                      corresponding
123                                                                                      runtime release
124                                                                                      notes for
125                                                                                      current
126                                                                                      information and
127                                                                                      level of
128                                                                                      support)*
129     =========== =============== ============ ===== ================= =============== =============== ======================
130     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
131     -----------------------------------------------------------------------------------------------------------------------
132     ``r600``                    ``r600``     dGPU                    - Does not
133                                                                        support
134                                                                        generic
135                                                                        address
136                                                                        space
137     ``r630``                    ``r600``     dGPU                    - Does not
138                                                                        support
139                                                                        generic
140                                                                        address
141                                                                        space
142     ``rs880``                   ``r600``     dGPU                    - Does not
143                                                                        support
144                                                                        generic
145                                                                        address
146                                                                        space
147     ``rv670``                   ``r600``     dGPU                    - Does not
148                                                                        support
149                                                                        generic
150                                                                        address
151                                                                        space
152     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
153     -----------------------------------------------------------------------------------------------------------------------
154     ``rv710``                   ``r600``     dGPU                    - Does not
155                                                                        support
156                                                                        generic
157                                                                        address
158                                                                        space
159     ``rv730``                   ``r600``     dGPU                    - Does not
160                                                                        support
161                                                                        generic
162                                                                        address
163                                                                        space
164     ``rv770``                   ``r600``     dGPU                    - Does not
165                                                                        support
166                                                                        generic
167                                                                        address
168                                                                        space
169     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
170     -----------------------------------------------------------------------------------------------------------------------
171     ``cedar``                   ``r600``     dGPU                    - Does not
172                                                                        support
173                                                                        generic
174                                                                        address
175                                                                        space
176     ``cypress``                 ``r600``     dGPU                    - Does not
177                                                                        support
178                                                                        generic
179                                                                        address
180                                                                        space
181     ``juniper``                 ``r600``     dGPU                    - Does not
182                                                                        support
183                                                                        generic
184                                                                        address
185                                                                        space
186     ``redwood``                 ``r600``     dGPU                    - Does not
187                                                                        support
188                                                                        generic
189                                                                        address
190                                                                        space
191     ``sumo``                    ``r600``     dGPU                    - Does not
192                                                                        support
193                                                                        generic
194                                                                        address
195                                                                        space
196     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
197     -----------------------------------------------------------------------------------------------------------------------
198     ``barts``                   ``r600``     dGPU                    - Does not
199                                                                        support
200                                                                        generic
201                                                                        address
202                                                                        space
203     ``caicos``                  ``r600``     dGPU                    - Does not
204                                                                        support
205                                                                        generic
206                                                                        address
207                                                                        space
208     ``cayman``                  ``r600``     dGPU                    - Does not
209                                                                        support
210                                                                        generic
211                                                                        address
212                                                                        space
213     ``turks``                   ``r600``     dGPU                    - Does not
214                                                                        support
215                                                                        generic
216                                                                        address
217                                                                        space
218     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
219     -----------------------------------------------------------------------------------------------------------------------
220     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
221                                                                        support
222                                                                        generic
223                                                                        address
224                                                                        space
225     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
226                 - ``verde``                                            support
227                                                                        generic
228                                                                        address
229                                                                        space
230     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
231                 - ``oland``                                            support
232                                                                        generic
233                                                                        address
234                                                                        space
235     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
236     -----------------------------------------------------------------------------------------------------------------------
237     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
238                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
239                                                                        scratch       - *pal-amdpal*  - A8-7100
240                                                                                                      - A8 Pro-7150B
241                                                                                                      - A10-7300
242                                                                                                      - A10 Pro-7350B
243                                                                                                      - FX-7500
244                                                                                                      - A8-7200P
245                                                                                                      - A10-7400P
246                                                                                                      - FX-7600P
247     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
248                                                                        flat          - *pal-amdhsa*  - FirePro W9100
249                                                                        scratch       - *pal-amdpal*  - FirePro S9150
250                                                                                                      - FirePro S9170
251     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
252                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
253                                                                        scratch       - *pal-amdpal*  - Radeon R390
254                                                                                                      - Radeon R390x
255     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
256                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
257                                                                        scratch                       - E1-2500
258                                                                                                      - E2-3000
259                                                                                                      - E2-3800
260                                                                                                      - A4-5000
261                                                                                                      - A4-5100
262                                                                                                      - A6-5200
263                                                                                                      - A4 Pro-3340B
264     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
265                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
266                                                                        scratch                       - R7 260
267                                                                                                      - R7 260X
268     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
269                                                                        flat          - *pal-amdpal*
270                                                                        scratch                       .. TODO::
271
272                                                                                                        Add product
273                                                                                                        names.
274
275     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
276     -----------------------------------------------------------------------------------------------------------------------
277     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
278                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
279                                                                        scratch       - *pal-amdpal*  - A8-8600P
280                                                                                                      - Pro A8-8600B
281                                                                                                      - FX-8800P
282                                                                                                      - Pro A12-8800B
283                                                                                                      - A10-8700P
284                                                                                                      - Pro A10-8700B
285                                                                                                      - A10-8780P
286                                                                                                      - A10-9600P
287                                                                                                      - A10-9630P
288                                                                                                      - A12-9700P
289                                                                                                      - A12-9730P
290                                                                                                      - FX-9800P
291                                                                                                      - FX-9830P
292                                                                                                      - E2-9010
293                                                                                                      - A6-9210
294                                                                                                      - A9-9410
295     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
296                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
297                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
298     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
299                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
300                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
301                                                                                                      - Radeon Pro Duo
302                                                                                                      - FirePro S9300x2
303                                                                                                      - Radeon Instinct MI8
304     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
305                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
306                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
307     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
308                                                                        flat          - *pal-amdhsa*
309                                                                        scratch       - *pal-amdpal*
310     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
311                                                                        flat          - *pal-amdhsa*  - FirePro S7100
312                                                                        scratch       - *pal-amdpal*  - FirePro W7100
313                                                                                                      - Mobile FirePro
314                                                                                                        M7170
315     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
316                                                                        flat          - *pal-amdhsa*
317                                                                        scratch       - *pal-amdpal*  .. TODO::
318
319                                                                                                        Add product
320                                                                                                        names.
321
322     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
323     -----------------------------------------------------------------------------------------------------------------------
324     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
325                                                                        flat          - *pal-amdhsa*    Frontier Edition
326                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
327                                                                                                      - Radeon RX Vega 64
328                                                                                                      - Radeon RX Vega 64
329                                                                                                        Liquid
330                                                                                                      - Radeon Instinct MI25
331     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
332                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
333                                                                        scratch       - *pal-amdpal*
334     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
335                                                                                      - *pal-amdhsa*
336                                                                                      - *pal-amdpal*  .. TODO::
337
338                                                                                                        Add product
339                                                                                                        names.
340
341     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
342                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
343                                                                        scratch       - *pal-amdpal*  - Radeon VII
344                                                                                                      - Radeon Pro VII
345     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
346                                                    - xnack           - Absolute
347                                                                        flat
348                                                                        scratch
349     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
350                                                                        flat
351                                                                        scratch                       .. TODO::
352
353                                                                                                        Add product
354                                                                                                        names.
355
356     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
357                                                    - tgsplit           flat
358                                                    - xnack             scratch                       .. TODO::
359                                                                      - Packed
360                                                                        work-item                       Add product
361                                                                        IDs                             names.
362
363     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
364                                                                        flat                          - Ryzen 7 4700GE
365                                                                        scratch                       - Ryzen 5 4600G
366                                                                                                      - Ryzen 5 4600GE
367                                                                                                      - Ryzen 3 4300G
368                                                                                                      - Ryzen 3 4300GE
369                                                                                                      - Ryzen Pro 4000G
370                                                                                                      - Ryzen 7 Pro 4700G
371                                                                                                      - Ryzen 7 Pro 4750GE
372                                                                                                      - Ryzen 5 Pro 4650G
373                                                                                                      - Ryzen 5 Pro 4650GE
374                                                                                                      - Ryzen 3 Pro 4350G
375                                                                                                      - Ryzen 3 Pro 4350GE
376
377     ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
378                                                    - tgsplit           flat
379                                                    - xnack             scratch                       .. TODO::
380                                                                      - Packed
381                                                                        work-item                       Add product
382                                                                        IDs                             names.
383
384     **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
385     -----------------------------------------------------------------------------------------------------------------------
386     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
387                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
388                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
389                                                                                                      - Radeon Pro 5600M
390     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
391                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
392                                                    - xnack             flat          - *pal-amdpal*
393                                                                        scratch
394     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
395                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
396                                                    - xnack             scratch       - *pal-amdpal*
397     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
398                                                    - wavefrontsize64   flat          - *pal-amdhsa*
399                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
400
401                                                                                                        Add product
402                                                                                                        names.
403
404     **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
405     -----------------------------------------------------------------------------------------------------------------------
406     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
407                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
408                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
409     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
410                                                    - wavefrontsize64   flat          - *pal-amdhsa*
411                                                                        scratch       - *pal-amdpal*
412     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
413                                                    - wavefrontsize64   flat          - *pal-amdhsa*
414                                                                        scratch       - *pal-amdpal*  .. TODO::
415
416                                                                                                        Add product
417                                                                                                        names.
418
419     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
420                                                    - wavefrontsize64   flat
421                                                                        scratch                       .. TODO::
422
423                                                                                                        Add product
424                                                                                                        names.
425     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
426                                                    - wavefrontsize64   flat
427                                                                        scratch                       .. TODO::
428
429                                                                                                        Add product
430                                                                                                        names.
431
432     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
433                                                    - wavefrontsize64   flat
434                                                                        scratch                       .. TODO::
435                                                                                                        Add product
436                                                                                                        names.
437
438     ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
439                                                    - wavefrontsize64   flat
440                                                                        scratch                       .. TODO::
441
442                                                                                                        Add product
443                                                                                                        names.
444
445     =========== =============== ============ ===== ================= =============== =============== ======================
446
447.. _amdgpu-target-features:
448
449Target Features
450---------------
451
452Target features control how code is generated to support certain
453processor specific features. Not all target features are supported by
454all processors. The runtime must ensure that the features supported by
455the device used to execute the code match the features enabled when
456generating the code. A mismatch of features may result in incorrect
457execution, or a reduction in performance.
458
459The target features supported by each processor is listed in
460:ref:`amdgpu-processor-table`.
461
462Target features are controlled by exactly one of the following Clang
463options:
464
465``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
466
467  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
468  optional components of the target ID. If omitted, the target feature has the
469  ``any`` value. See :ref:`amdgpu-target-id`.
470
471``-m[no-]<target-feature>``
472
473  Target features not specified by the target ID are specified using a
474  separate option. These target features can have an ``on`` or ``off``
475  value.  ``on`` is specified by omitting the ``no-`` prefix, and
476  ``off`` is specified by including the ``no-`` prefix. The default
477  if not specified is ``off``.
478
479For example:
480
481``-mcpu=gfx908:xnack+``
482  Enable the ``xnack`` feature.
483``-mcpu=gfx908:xnack-``
484  Disable the ``xnack`` feature.
485``-mcumode``
486  Enable the ``cumode`` feature.
487``-mno-cumode``
488  Disable the ``cumode`` feature.
489
490  .. table:: AMDGPU Target Features
491     :name: amdgpu-target-features-table
492
493     =============== ============================ ==================================================
494     Target Feature  Clang Option to Control      Description
495     Name
496     =============== ============================ ==================================================
497     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
498                                                  when generating code for kernels. When disabled
499                                                  native WGP wavefront execution mode is used,
500                                                  when enabled CU wavefront execution mode is used
501                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
502
503     sramecc         - ``-mcpu``                  If specified, generate code that can only be
504                     - ``--offload-arch``         loaded and executed in a process that has a
505                                                  matching setting for SRAMECC.
506
507                                                  If not specified for code object V2 to V3, generate
508                                                  code that can be loaded and executed in a process
509                                                  with SRAMECC enabled.
510
511                                                  If not specified for code object V4 or above, generate
512                                                  code that can be loaded and executed in a process
513                                                  with either setting of SRAMECC.
514
515     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
516                                                  work-groups are launched in threadgroup split mode.
517                                                  When enabled the waves of a work-group may be
518                                                  launched in different CUs.
519
520     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
521                                                  generating code for kernels. When disabled
522                                                  native wavefront size 32 is used, when enabled
523                                                  wavefront size 64 is used.
524
525     xnack           - ``-mcpu``                  If specified, generate code that can only be
526                     - ``--offload-arch``         loaded and executed in a process that has a
527                                                  matching setting for XNACK replay.
528
529                                                  If not specified for code object V2 to V3, generate
530                                                  code that can be loaded and executed in a process
531                                                  with XNACK replay enabled.
532
533                                                  If not specified for code object V4 or above, generate
534                                                  code that can be loaded and executed in a process
535                                                  with either setting of XNACK replay.
536
537                                                  XNACK replay can be used for demand paging and
538                                                  page migration. If enabled in the device, then if
539                                                  a page fault occurs the code may execute
540                                                  incorrectly unless generated with XNACK replay
541                                                  enabled, or generated for code object V4 or above without
542                                                  specifying XNACK replay. Executing code that was
543                                                  generated with XNACK replay enabled, or generated
544                                                  for code object V4 or above without specifying XNACK replay,
545                                                  on a device that does not have XNACK replay
546                                                  enabled will execute correctly but may be less
547                                                  performant than code generated for XNACK replay
548                                                  disabled.
549     =============== ============================ ==================================================
550
551.. _amdgpu-target-id:
552
553Target ID
554---------
555
556AMDGPU supports target IDs. See `Clang Offload Bundler
557<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
558description. The AMDGPU target specific information is:
559
560**processor**
561  Is an AMDGPU processor or alternative processor name specified in
562  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
563  the primary processor and alternative processor names. The canonical form
564  target ID only allow the primary processor name.
565
566**target-feature**
567  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
568  is supported by the processor. The target features supported by each processor
569  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
570  a target ID are marked as being controlled by ``-mcpu`` and
571  ``--offload-arch``. Each target feature must appear at most once in a target
572  ID. The non-canonical form target ID allows the target features to be
573  specified in any order. The canonical form target ID requires the target
574  features to be specified in alphabetic order.
575
576.. _amdgpu-target-id-v2-v3:
577
578Code Object V2 to V3 Target ID
579~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580
581The target ID syntax for code object V2 to V3 is the same as defined in `Clang
582Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
583when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
584directive and the bundle entry ID. In those cases it has the following BNF
585syntax:
586
587.. code::
588
589  <target-id> ::== <processor> ( "+" <target-feature> )*
590
591Where a target feature is omitted if *Off* and present if *On* or *Any*.
592
593.. note::
594
595  The code object V2 to V3 cannot represent *Any* and treats it the same as
596  *On*.
597
598.. _amdgpu-embedding-bundled-objects:
599
600Embedding Bundled Code Objects
601------------------------------
602
603AMDGPU supports the HIP and OpenMP languages that perform code object embedding
604as described in `Clang Offload Bundler
605<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
606
607.. note::
608
609  The target ID syntax used for code object V2 to V3 for a bundle entry ID
610  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
611
612.. _amdgpu-address-spaces:
613
614Address Spaces
615--------------
616
617The AMDGPU architecture supports a number of memory address spaces. The address
618space names use the OpenCL standard names, with some additions.
619
620The AMDGPU address spaces correspond to target architecture specific LLVM
621address space numbers used in LLVM IR.
622
623The AMDGPU address spaces are described in
624:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
625supported for the ``amdgcn`` target.
626
627  .. table:: AMDGPU Address Spaces
628     :name: amdgpu-address-spaces-table
629
630     ================================= =============== =========== ================ ======= ============================
631     ..                                                                                     64-Bit Process Address Space
632     --------------------------------- --------------- ----------- ---------------- ------------------------------------
633     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
634                                       Space Number    Name        Name             Size
635     ================================= =============== =========== ================ ======= ============================
636     Generic                           0               flat        flat             64      0x0000000000000000
637     Global                            1               global      global           64      0x0000000000000000
638     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
639     Local                             3               group       LDS              32      0xFFFFFFFF
640     Constant                          4               constant    *same as global* 64      0x0000000000000000
641     Private                           5               private     scratch          32      0xFFFFFFFF
642     Constant 32-bit                   6               *TODO*                               0x00000000
643     Buffer Fat Pointer (experimental) 7               *TODO*
644     ================================= =============== =========== ================ ======= ============================
645
646**Generic**
647  The generic address space is supported unless the *Target Properties* column
648  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
649  space*.
650
651  The generic address space uses the hardware flat address support for two fixed
652  ranges of virtual addresses (the private and local apertures), that are
653  outside the range of addressable global memory, to map from a flat address to
654  a private or local address. This uses FLAT instructions that can take a flat
655  address and access global, private (scratch), and group (LDS) memory depending
656  on if the address is within one of the aperture ranges.
657
658  Flat access to scratch requires hardware aperture setup and setup in the
659  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
660  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
661  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
662
663  To convert between a private or group address space address (termed a segment
664  address) and a flat address the base address of the corresponding aperture
665  can be used. For GFX7-GFX8 these are available in the
666  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
667  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
668  GFX9-GFX10 the aperture base addresses are directly available as inline
669  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
670  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
671  aligned to 2^32 which makes it easier to convert from flat to segment or
672  segment to flat.
673
674  A global address space address has the same value when used as a flat address
675  so no conversion is needed.
676
677**Global and Constant**
678  The global and constant address spaces both use global virtual addresses,
679  which are the same virtual address space used by the CPU. However, some
680  virtual addresses may only be accessible to the CPU, some only accessible
681  by the GPU, and some by both.
682
683  Using the constant address space indicates that the data will not change
684  during the execution of the kernel. This allows scalar read instructions to
685  be used. As the constant address space could only be modified on the host
686  side, a generic pointer loaded from the constant address space is safe to be
687  assumed as a global pointer since only the device global memory is visible
688  and managed on the host side. The vector and scalar L1 caches are invalidated
689  of volatile data before each kernel dispatch execution to allow constant
690  memory to change values between kernel dispatches.
691
692**Region**
693  The region address space uses the hardware Global Data Store (GDS). All
694  wavefronts executing on the same device will access the same memory for any
695  given region address. However, the same region address accessed by wavefronts
696  executing on different devices will access different memory. It is higher
697  performance than global memory. It is allocated by the runtime. The data
698  store (DS) instructions can be used to access it.
699
700**Local**
701  The local address space uses the hardware Local Data Store (LDS) which is
702  automatically allocated when the hardware creates the wavefronts of a
703  work-group, and freed when all the wavefronts of a work-group have
704  terminated. All wavefronts belonging to the same work-group will access the
705  same memory for any given local address. However, the same local address
706  accessed by wavefronts belonging to different work-groups will access
707  different memory. It is higher performance than global memory. The data store
708  (DS) instructions can be used to access it.
709
710**Private**
711  The private address space uses the hardware scratch memory support which
712  automatically allocates memory when it creates a wavefront and frees it when
713  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
714  given private address will be different to the memory accessed by another lane
715  of the same or different wavefront for the same private address.
716
717  If a kernel dispatch uses scratch, then the hardware allocates memory from a
718  pool of backing memory allocated by the runtime for each wavefront. The lanes
719  of the wavefront access this using dword (4 byte) interleaving. The mapping
720  used from private address to backing memory address is:
721
722    ``wavefront-scratch-base +
723    ((private-address / 4) * wavefront-size * 4) +
724    (wavefront-lane-id * 4) + (private-address % 4)``
725
726  If each lane of a wavefront accesses the same private address, the
727  interleaving results in adjacent dwords being accessed and hence requires
728  fewer cache lines to be fetched.
729
730  There are different ways that the wavefront scratch base address is
731  determined by a wavefront (see
732  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
733
734  Scratch memory can be accessed in an interleaved manner using buffer
735  instructions with the scratch buffer descriptor and per wavefront scratch
736  offset, by the scratch instructions, or by flat instructions. Multi-dword
737  access is not supported except by flat and scratch instructions in
738  GFX9-GFX10.
739
740**Constant 32-bit**
741  *TODO*
742
743**Buffer Fat Pointer**
744  The buffer fat pointer is an experimental address space that is currently
745  unsupported in the backend. It exposes a non-integral pointer that is in
746  the future intended to support the modelling of 128-bit buffer descriptors
747  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
748  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
749  model the buffer descriptors used heavily in graphics workloads targeting
750  the backend.
751
752.. _amdgpu-memory-scopes:
753
754Memory Scopes
755-------------
756
757This section provides LLVM memory synchronization scopes supported by the AMDGPU
758backend memory model when the target triple OS is ``amdhsa`` (see
759:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
760
761The memory model supported is based on the HSA memory model [HSA]_ which is
762based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
763relation is transitive over the synchronizes-with relation independent of scope
764and synchronizes-with allows the memory scope instances to be inclusive (see
765table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
766
767This is different to the OpenCL [OpenCL]_ memory model which does not have scope
768inclusion and requires the memory scopes to exactly match. However, this
769is conservatively correct for OpenCL.
770
771  .. table:: AMDHSA LLVM Sync Scopes
772     :name: amdgpu-amdhsa-llvm-sync-scopes-table
773
774     ======================= ===================================================
775     LLVM Sync Scope         Description
776     ======================= ===================================================
777     *none*                  The default: ``system``.
778
779                             Synchronizes with, and participates in modification
780                             and seq_cst total orderings with, other operations
781                             (except image operations) for all address spaces
782                             (except private, or generic that accesses private)
783                             provided the other operation's sync scope is:
784
785                             - ``system``.
786                             - ``agent`` and executed by a thread on the same
787                               agent.
788                             - ``workgroup`` and executed by a thread in the
789                               same work-group.
790                             - ``wavefront`` and executed by a thread in the
791                               same wavefront.
792
793     ``agent``               Synchronizes with, and participates in modification
794                             and seq_cst total orderings with, other operations
795                             (except image operations) for all address spaces
796                             (except private, or generic that accesses private)
797                             provided the other operation's sync scope is:
798
799                             - ``system`` or ``agent`` and executed by a thread
800                               on the same agent.
801                             - ``workgroup`` and executed by a thread in the
802                               same work-group.
803                             - ``wavefront`` and executed by a thread in the
804                               same wavefront.
805
806     ``workgroup``           Synchronizes with, and participates in modification
807                             and seq_cst total orderings with, other operations
808                             (except image operations) for all address spaces
809                             (except private, or generic that accesses private)
810                             provided the other operation's sync scope is:
811
812                             - ``system``, ``agent`` or ``workgroup`` and
813                               executed by a thread in the same work-group.
814                             - ``wavefront`` and executed by a thread in the
815                               same wavefront.
816
817     ``wavefront``           Synchronizes with, and participates in modification
818                             and seq_cst total orderings with, other operations
819                             (except image operations) for all address spaces
820                             (except private, or generic that accesses private)
821                             provided the other operation's sync scope is:
822
823                             - ``system``, ``agent``, ``workgroup`` or
824                               ``wavefront`` and executed by a thread in the
825                               same wavefront.
826
827     ``singlethread``        Only synchronizes with and participates in
828                             modification and seq_cst total orderings with,
829                             other operations (except image operations) running
830                             in the same thread for all address spaces (for
831                             example, in signal handlers).
832
833     ``one-as``              Same as ``system`` but only synchronizes with other
834                             operations within the same address space.
835
836     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
837                             operations within the same address space.
838
839     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
840                             other operations within the same address space.
841
842     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
843                             other operations within the same address space.
844
845     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
846                             other operations within the same address space.
847     ======================= ===================================================
848
849LLVM IR Intrinsics
850------------------
851
852The AMDGPU backend implements the following LLVM IR intrinsics.
853
854*This section is WIP.*
855
856.. TODO::
857
858   List AMDGPU intrinsics.
859
860LLVM IR Attributes
861------------------
862
863The AMDGPU backend supports the following LLVM IR attributes.
864
865  .. table:: AMDGPU LLVM IR Attributes
866     :name: amdgpu-llvm-ir-attributes-table
867
868     ======================================= ==========================================================
869     LLVM Attribute                          Description
870     ======================================= ==========================================================
871     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
872                                             will be specified when the kernel is dispatched. Generated
873                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
874                                             The implied default value is 1,1024.
875
876     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
877                                             argument block size for the implicit arguments. This
878                                             varies by OS and language (for OpenCL see
879                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
880     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
881                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
882     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
883                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
884     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
885                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
886                                             CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
887                                             and the backend may not be able to satisfy the request. If
888                                             the specified range is incompatible with the function's
889                                             "amdgpu-flat-work-group-size" value, the implied occupancy
890                                             bounds by the workgroup size takes precedence.
891
892     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
893                                             mode register to be set on entry. Overrides the default for
894                                             the calling convention.
895     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
896                                             the mode register to be set on entry. Overrides the default
897                                             for the calling convention.
898
899     "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
900                                             llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
901                                             attribute, or reached through a call site marked with this attribute,
902                                             the value returned by the intrinsic is undefined. The backend can
903                                             generally infer this during code generation, so typically there is no
904                                             benefit to frontends marking functions with this.
905
906     "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
907                                             llvm.amdgcn.workitem.id.y intrinsic.
908
909     "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
910                                             llvm.amdgcn.workitem.id.z intrinsic.
911
912     "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
913                                             llvm.amdgcn.workgroup.id.x intrinsic.
914
915     "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
916                                             llvm.amdgcn.workgroup.id.y intrinsic.
917
918     "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
919                                             llvm.amdgcn.workgroup.id.z intrinsic.
920
921     "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
922                                             llvm.amdgcn.dispatch.ptr intrinsic.
923
924     "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
925                                             llvm.amdgcn.implicitarg.ptr intrinsic.
926
927     "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
928                                             llvm.amdgcn.dispatch.id intrinsic.
929
930     "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
931                                             llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
932                                             attributes, the queue pointer may be required in situations where the
933                                             intrinsic call does not directly appear in the program. Some subtargets
934                                             require the queue pointer for to handle some addrspacecasts, as well
935                                             as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
936                                             llvm.debug intrinsics.
937
938     "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
939                                             kernel argument that holds the pointer to the hostcall buffer. If this
940                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
941
942     "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
943                                             kernel argument that holds the pointer to an initialized memory buffer
944                                             that conforms to the requirements of the malloc/free device library V1
945                                             version implementation. If this attribute is absent, then the
946                                             amdgpu-no-implicitarg-ptr is also removed.
947
948     ======================================= ==========================================================
949
950.. _amdgpu-elf-code-object:
951
952ELF Code Object
953===============
954
955The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
956can be linked by ``lld`` to produce a standard ELF shared code object which can
957be loaded and executed on an AMDGPU target.
958
959.. _amdgpu-elf-header:
960
961Header
962------
963
964The AMDGPU backend uses the following ELF header:
965
966  .. table:: AMDGPU ELF Header
967     :name: amdgpu-elf-header-table
968
969     ========================== ===============================
970     Field                      Value
971     ========================== ===============================
972     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
973     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
974     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
975                                - ``ELFOSABI_AMDGPU_HSA``
976                                - ``ELFOSABI_AMDGPU_PAL``
977                                - ``ELFOSABI_AMDGPU_MESA3D``
978     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
979                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
980                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
981                                - ``ELFABIVERSION_AMDGPU_HSA_V5``
982                                - ``ELFABIVERSION_AMDGPU_PAL``
983                                - ``ELFABIVERSION_AMDGPU_MESA3D``
984     ``e_type``                 - ``ET_REL``
985                                - ``ET_DYN``
986     ``e_machine``              ``EM_AMDGPU``
987     ``e_entry``                0
988     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
989                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
990                                and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
991     ========================== ===============================
992
993..
994
995  .. table:: AMDGPU ELF Header Enumeration Values
996     :name: amdgpu-elf-header-enumeration-values-table
997
998     =============================== =====
999     Name                            Value
1000     =============================== =====
1001     ``EM_AMDGPU``                   224
1002     ``ELFOSABI_NONE``               0
1003     ``ELFOSABI_AMDGPU_HSA``         64
1004     ``ELFOSABI_AMDGPU_PAL``         65
1005     ``ELFOSABI_AMDGPU_MESA3D``      66
1006     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1007     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1008     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1009     ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1010     ``ELFABIVERSION_AMDGPU_PAL``    0
1011     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1012     =============================== =====
1013
1014``e_ident[EI_CLASS]``
1015  The ELF class is:
1016
1017  * ``ELFCLASS32`` for ``r600`` architecture.
1018
1019  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1020    process address space applications.
1021
1022``e_ident[EI_DATA]``
1023  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1024
1025``e_ident[EI_OSABI]``
1026  One of the following AMDGPU target architecture specific OS ABIs
1027  (see :ref:`amdgpu-os`):
1028
1029  * ``ELFOSABI_NONE`` for *unknown* OS.
1030
1031  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1032
1033  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1034
1035  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1036
1037``e_ident[EI_ABIVERSION]``
1038  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1039  object conforms:
1040
1041  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1042    runtime ABI for code object V2. Specify using the Clang option
1043    ``-mcode-object-version=2``.
1044
1045  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1046    runtime ABI for code object V3. Specify using the Clang option
1047    ``-mcode-object-version=3``.
1048
1049  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1050    runtime ABI for code object V4. Specify using the Clang option
1051    ``-mcode-object-version=4``. This is the default code object
1052    version if not specified.
1053
1054  * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1055    runtime ABI for code object V5. Specify using the Clang option
1056    ``-mcode-object-version=5``.
1057
1058  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1059    runtime ABI.
1060
1061  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1062    3D runtime ABI.
1063
1064``e_type``
1065  Can be one of the following values:
1066
1067
1068  ``ET_REL``
1069    The type produced by the AMDGPU backend compiler as it is relocatable code
1070    object.
1071
1072  ``ET_DYN``
1073    The type produced by the linker as it is a shared code object.
1074
1075  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1076
1077``e_machine``
1078  The value ``EM_AMDGPU`` is used for the machine for all processors supported
1079  by the ``r600`` and ``amdgcn`` architectures (see
1080  :ref:`amdgpu-processor-table`). The specific processor is specified in the
1081  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1082  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1083  ``e_flags`` for code object V3 and above (see
1084  :ref:`amdgpu-elf-header-e_flags-table-v3` and
1085  :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1086
1087``e_entry``
1088  The entry point is 0 as the entry points for individual kernels must be
1089  selected in order to invoke them through AQL packets.
1090
1091``e_flags``
1092  The AMDGPU backend uses the following ELF header flags:
1093
1094  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1095     :name: amdgpu-elf-header-e_flags-v2-table
1096
1097     ===================================== ===== =============================
1098     Name                                  Value Description
1099     ===================================== ===== =============================
1100     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1101                                                 target feature is
1102                                                 enabled for all code
1103                                                 contained in the code object.
1104                                                 If the processor
1105                                                 does not support the
1106                                                 ``xnack`` target
1107                                                 feature then must
1108                                                 be 0.
1109                                                 See
1110                                                 :ref:`amdgpu-target-features`.
1111     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1112                                                 handler is enabled for all
1113                                                 code contained in the code
1114                                                 object. If the processor
1115                                                 does not support a trap
1116                                                 handler then must be 0.
1117                                                 See
1118                                                 :ref:`amdgpu-target-features`.
1119     ===================================== ===== =============================
1120
1121  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1122     :name: amdgpu-elf-header-e_flags-table-v3
1123
1124     ================================= ===== =============================
1125     Name                              Value Description
1126     ================================= ===== =============================
1127     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1128                                             mask for
1129                                             ``EF_AMDGPU_MACH_xxx`` values
1130                                             defined in
1131                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
1132     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1133                                             target feature is
1134                                             enabled for all code
1135                                             contained in the code object.
1136                                             If the processor
1137                                             does not support the
1138                                             ``xnack`` target
1139                                             feature then must
1140                                             be 0.
1141                                             See
1142                                             :ref:`amdgpu-target-features`.
1143     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1144                                             target feature is
1145                                             enabled for all code
1146                                             contained in the code object.
1147                                             If the processor
1148                                             does not support the
1149                                             ``sramecc`` target
1150                                             feature then must
1151                                             be 0.
1152                                             See
1153                                             :ref:`amdgpu-target-features`.
1154     ================================= ===== =============================
1155
1156  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1157     :name: amdgpu-elf-header-e_flags-table-v4-onwards
1158
1159     ============================================ ===== ===================================
1160     Name                                         Value      Description
1161     ============================================ ===== ===================================
1162     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1163                                                        mask for
1164                                                        ``EF_AMDGPU_MACH_xxx`` values
1165                                                        defined in
1166                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
1167     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1168                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1169                                                        values.
1170     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1171     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1172     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1173     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1174     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1175                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1176                                                        values.
1177     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1178     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1179     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1180     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1181     ============================================ ===== ===================================
1182
1183  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1184     :name: amdgpu-ef-amdgpu-mach-table
1185
1186     ==================================== ========== =============================
1187     Name                                 Value      Description (see
1188                                                     :ref:`amdgpu-processor-table`)
1189     ==================================== ========== =============================
1190     ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1191     ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1192     ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1193     ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1194     ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1195     ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1196     ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1197     ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1198     ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1199     ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1200     ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1201     ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1202     ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1203     ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1204     ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1205     ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1206     ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1207     *reserved*                           0x011 -    Reserved for ``r600``
1208                                          0x01f      architecture processors.
1209     ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1210     ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1211     ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1212     ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1213     ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1214     ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1215     ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1216     *reserved*                           0x027      Reserved.
1217     ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1218     ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1219     ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1220     ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1221     ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1222     ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1223     ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1224     ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1225     ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1226     ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1227     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1228     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1229     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1230     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1231     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1232     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1233     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1234     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1235     ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1236     ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1237     ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1238     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1239     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1240     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1241     ``EF_AMDGPU_MACH_AMDGCN_GFX940``     0x040      ``gfx940``
1242     *reserved*                           0x041      Reserved.
1243     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1244     *reserved*                           0x043      Reserved.
1245     *reserved*                           0x044      Reserved.
1246     ``EF_AMDGPU_MACH_AMDGCN_GFX1036``    0x045      ``gfx1036``
1247     ==================================== ========== =============================
1248
1249Sections
1250--------
1251
1252An AMDGPU target ELF code object has the standard ELF sections which include:
1253
1254  .. table:: AMDGPU ELF Sections
1255     :name: amdgpu-elf-sections-table
1256
1257     ================== ================ =================================
1258     Name               Type             Attributes
1259     ================== ================ =================================
1260     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1261     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1262     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1263     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1264     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1265     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1266     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1267     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1268     ``.note``          ``SHT_NOTE``     *none*
1269     ``.rela``\ *name*  ``SHT_RELA``     *none*
1270     ``.rela.dyn``      ``SHT_RELA``     *none*
1271     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1272     ``.shstrtab``      ``SHT_STRTAB``   *none*
1273     ``.strtab``        ``SHT_STRTAB``   *none*
1274     ``.symtab``        ``SHT_SYMTAB``   *none*
1275     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1276     ================== ================ =================================
1277
1278These sections have their standard meanings (see [ELF]_) and are only generated
1279if needed.
1280
1281``.debug``\ *\**
1282  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1283  information on the DWARF produced by the AMDGPU backend.
1284
1285``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1286  The standard sections used by a dynamic loader.
1287
1288``.note``
1289  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1290  backend.
1291
1292``.rela``\ *name*, ``.rela.dyn``
1293  For relocatable code objects, *name* is the name of the section that the
1294  relocation records apply. For example, ``.rela.text`` is the section name for
1295  relocation records associated with the ``.text`` section.
1296
1297  For linked shared code objects, ``.rela.dyn`` contains all the relocation
1298  records from each of the relocatable code object's ``.rela``\ *name* sections.
1299
1300  See :ref:`amdgpu-relocation-records` for the relocation records supported by
1301  the AMDGPU backend.
1302
1303``.text``
1304  The executable machine code for the kernels and functions they call. Generated
1305  as position independent code. See :ref:`amdgpu-code-conventions` for
1306  information on conventions used in the isa generation.
1307
1308.. _amdgpu-note-records:
1309
1310Note Records
1311------------
1312
1313The AMDGPU backend code object contains ELF note records in the ``.note``
1314section. The set of generated notes and their semantics depend on the code
1315object version; see :ref:`amdgpu-note-records-v2` and
1316:ref:`amdgpu-note-records-v3-onwards`.
1317
1318As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1319must be generated after the ``name`` field to ensure the ``desc`` field is 4
1320byte aligned. In addition, minimal zero-byte padding must be generated to
1321ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1322field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1323alignment.
1324
1325.. _amdgpu-note-records-v2:
1326
1327Code Object V2 Note Records
1328~~~~~~~~~~~~~~~~~~~~~~~~~~~
1329
1330.. warning::
1331  Code object V2 is not the default code object version emitted by
1332  this version of LLVM.
1333
1334The AMDGPU backend code object uses the following ELF note record in the
1335``.note`` section when compiling for code object V2.
1336
1337The note record vendor field is "AMD".
1338
1339Additional note records may be present, but any which are not documented here
1340are deprecated and should not be used.
1341
1342  .. table:: AMDGPU Code Object V2 ELF Note Records
1343     :name: amdgpu-elf-note-records-v2-table
1344
1345     ===== ===================================== ======================================
1346     Name  Type                                  Description
1347     ===== ===================================== ======================================
1348     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1349     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1350                                                 Finalizer and not the LLVM compiler.
1351     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1352     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1353                                                 YAML [YAML]_ textual format.
1354     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1355     ===== ===================================== ======================================
1356
1357..
1358
1359  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1360     :name: amdgpu-elf-note-record-enumeration-values-v2-table
1361
1362     ===================================== =====
1363     Name                                  Value
1364     ===================================== =====
1365     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1366     ``NT_AMD_HSA_HSAIL``                  2
1367     ``NT_AMD_HSA_ISA_VERSION``            3
1368     *reserved*                            4-9
1369     ``NT_AMD_HSA_METADATA``               10
1370     ``NT_AMD_HSA_ISA_NAME``               11
1371     ===================================== =====
1372
1373``NT_AMD_HSA_CODE_OBJECT_VERSION``
1374  Specifies the code object version number. The description field has the
1375  following layout:
1376
1377  .. code:: c
1378
1379    struct amdgpu_hsa_note_code_object_version_s {
1380      uint32_t major_version;
1381      uint32_t minor_version;
1382    };
1383
1384  The ``major_version`` has a value less than or equal to 2.
1385
1386``NT_AMD_HSA_HSAIL``
1387  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1388  field has the following layout:
1389
1390  .. code:: c
1391
1392    struct amdgpu_hsa_note_hsail_s {
1393      uint32_t hsail_major_version;
1394      uint32_t hsail_minor_version;
1395      uint8_t profile;
1396      uint8_t machine_model;
1397      uint8_t default_float_round;
1398    };
1399
1400``NT_AMD_HSA_ISA_VERSION``
1401  Specifies the target ISA version. The description field has the following layout:
1402
1403  .. code:: c
1404
1405    struct amdgpu_hsa_note_isa_s {
1406      uint16_t vendor_name_size;
1407      uint16_t architecture_name_size;
1408      uint32_t major;
1409      uint32_t minor;
1410      uint32_t stepping;
1411      char vendor_and_architecture_name[1];
1412    };
1413
1414  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1415  vendor and architecture names respectively, including the NUL character.
1416
1417  ``vendor_and_architecture_name`` contains the NUL terminates string for the
1418  vendor, immediately followed by the NUL terminated string for the
1419  architecture.
1420
1421  This note record is used by the HSA runtime loader.
1422
1423  Code object V2 only supports a limited number of processors and has fixed
1424  settings for target features. See
1425  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1426  processors and the corresponding target ID. In the table the note record ISA
1427  name is a concatenation of the vendor name, architecture name, major, minor,
1428  and stepping separated by a ":".
1429
1430  The target ID column shows the processor name and fixed target features used
1431  by the LLVM compiler. The LLVM compiler does not generate a
1432  ``NT_AMD_HSA_HSAIL`` note record.
1433
1434  A code object generated by the Finalizer also uses code object V2 and always
1435  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1436  ``sramecc`` target feature is as shown in
1437  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1438  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1439  bit.
1440
1441``NT_AMD_HSA_ISA_NAME``
1442  Specifies the target ISA name as a non-NUL terminated string.
1443
1444  This note record is not used by the HSA runtime loader.
1445
1446  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1447  V2's limited support of processors and fixed settings for target features.
1448
1449  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1450  from the string to the corresponding target ID. If the ``xnack`` target
1451  feature is supported and enabled, the string produced by the LLVM compiler
1452  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1453  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1454
1455``NT_AMD_HSA_METADATA``
1456  Specifies extensible metadata associated with the code objects executed on HSA
1457  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1458  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1459  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1460  metadata string.
1461
1462  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1463     :name: amdgpu-elf-note-record-supported_processors-v2-table
1464
1465     ===================== ==========================
1466     Note Record ISA Name  Target ID
1467     ===================== ==========================
1468     ``AMD:AMDGPU:6:0:0``  ``gfx600``
1469     ``AMD:AMDGPU:6:0:1``  ``gfx601``
1470     ``AMD:AMDGPU:6:0:2``  ``gfx602``
1471     ``AMD:AMDGPU:7:0:0``  ``gfx700``
1472     ``AMD:AMDGPU:7:0:1``  ``gfx701``
1473     ``AMD:AMDGPU:7:0:2``  ``gfx702``
1474     ``AMD:AMDGPU:7:0:3``  ``gfx703``
1475     ``AMD:AMDGPU:7:0:4``  ``gfx704``
1476     ``AMD:AMDGPU:7:0:5``  ``gfx705``
1477     ``AMD:AMDGPU:8:0:0``  ``gfx802``
1478     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1479     ``AMD:AMDGPU:8:0:2``  ``gfx802``
1480     ``AMD:AMDGPU:8:0:3``  ``gfx803``
1481     ``AMD:AMDGPU:8:0:4``  ``gfx803``
1482     ``AMD:AMDGPU:8:0:5``  ``gfx805``
1483     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1484     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1485     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1486     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1487     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1488     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1489     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1490     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1491     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1492     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1493     ===================== ==========================
1494
1495.. _amdgpu-note-records-v3-onwards:
1496
1497Code Object V3 and Above Note Records
1498~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1499
1500The AMDGPU backend code object uses the following ELF note record in the
1501``.note`` section when compiling for code object V3 and above.
1502
1503The note record vendor field is "AMDGPU".
1504
1505Additional note records may be present, but any which are not documented here
1506are deprecated and should not be used.
1507
1508  .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1509     :name: amdgpu-elf-note-records-table-v3-onwards
1510
1511     ======== ============================== ======================================
1512     Name     Type                           Description
1513     ======== ============================== ======================================
1514     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1515                                             binary format.
1516     ======== ============================== ======================================
1517
1518..
1519
1520  .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1521     :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1522
1523     ============================== =====
1524     Name                           Value
1525     ============================== =====
1526     *reserved*                     0-31
1527     ``NT_AMDGPU_METADATA``         32
1528     ============================== =====
1529
1530``NT_AMDGPU_METADATA``
1531  Specifies extensible metadata associated with an AMDGPU code object. It is
1532  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1533  :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1534  :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1535  :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1536  ``amdhsa`` OS.
1537
1538.. _amdgpu-symbols:
1539
1540Symbols
1541-------
1542
1543Symbols include the following:
1544
1545  .. table:: AMDGPU ELF Symbols
1546     :name: amdgpu-elf-symbols-table
1547
1548     ===================== ================== ================ ==================
1549     Name                  Type               Section          Description
1550     ===================== ================== ================ ==================
1551     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1552                                              - ``.rodata``
1553                                              - ``.bss``
1554     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1555     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1556     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1557     ===================== ================== ================ ==================
1558
1559Global variable
1560  Global variables both used and defined by the compilation unit.
1561
1562  If the symbol is defined in the compilation unit then it is allocated in the
1563  appropriate section according to if it has initialized data or is readonly.
1564
1565  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1566  will resolve relocations using the definition provided by another code object
1567  or explicitly defined by the runtime.
1568
1569  If the symbol resides in local/group memory (LDS) then its section is the
1570  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1571  ``st_value`` field describes alignment requirements as it does for common
1572  symbols.
1573
1574  .. TODO::
1575
1576     Add description of linked shared object symbols. Seems undefined symbols
1577     are marked as STT_NOTYPE.
1578
1579Kernel descriptor
1580  Every HSA kernel has an associated kernel descriptor. It is the address of the
1581  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1582  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1583  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1584
1585Kernel entry point
1586  Every HSA kernel also has a symbol for its machine code entry point.
1587
1588.. _amdgpu-relocation-records:
1589
1590Relocation Records
1591------------------
1592
1593AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1594relocatable fields are:
1595
1596``word32``
1597  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1598  alignment. These values use the same byte order as other word values in the
1599  AMDGPU architecture.
1600
1601``word64``
1602  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1603  alignment. These values use the same byte order as other word values in the
1604  AMDGPU architecture.
1605
1606Following notations are used for specifying relocation calculations:
1607
1608**A**
1609  Represents the addend used to compute the value of the relocatable field.
1610
1611**G**
1612  Represents the offset into the global offset table at which the relocation
1613  entry's symbol will reside during execution.
1614
1615**GOT**
1616  Represents the address of the global offset table.
1617
1618**P**
1619  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1620  of the storage unit being relocated (computed using ``r_offset``).
1621
1622**S**
1623  Represents the value of the symbol whose index resides in the relocation
1624  entry. Relocations not using this must specify a symbol index of
1625  ``STN_UNDEF``.
1626
1627**B**
1628  Represents the base address of a loaded executable or shared object which is
1629  the difference between the ELF address and the actual load address.
1630  Relocations using this are only valid in executable or shared objects.
1631
1632The following relocation types are supported:
1633
1634  .. table:: AMDGPU ELF Relocation Records
1635     :name: amdgpu-elf-relocation-records-table
1636
1637     ========================== ======= =====  ==========  ==============================
1638     Relocation Type            Kind    Value  Field       Calculation
1639     ========================== ======= =====  ==========  ==============================
1640     ``R_AMDGPU_NONE``                  0      *none*      *none*
1641     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1642                                Dynamic
1643     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1644                                Dynamic
1645     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1646                                Dynamic
1647     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1648     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1649     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1650                                Dynamic
1651     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1652     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1653     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1654     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1655     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1656     *reserved*                         12
1657     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1658     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1659     ========================== ======= =====  ==========  ==============================
1660
1661``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1662the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1663
1664There is no current OS loader support for 32-bit programs and so
1665``R_AMDGPU_ABS32`` is not used.
1666
1667.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1668
1669Loaded Code Object Path Uniform Resource Identifier (URI)
1670---------------------------------------------------------
1671
1672The AMD GPU code object loader represents the path of the ELF shared object from
1673which the code object was loaded as a textual Uniform Resource Identifier (URI).
1674Note that the code object is the in memory loaded relocated form of the ELF
1675shared object.  Multiple code objects may be loaded at different memory
1676addresses in the same process from the same ELF shared object.
1677
1678The loaded code object path URI syntax is defined by the following BNF syntax:
1679
1680.. code::
1681
1682  code_object_uri ::== file_uri | memory_uri
1683  file_uri        ::== "file://" file_path [ range_specifier ]
1684  memory_uri      ::== "memory://" process_id range_specifier
1685  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1686  file_path       ::== URI_ENCODED_OS_FILE_PATH
1687  process_id      ::== DECIMAL_NUMBER
1688  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1689
1690**number**
1691  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1692  and octal values by "0".
1693
1694**file_path**
1695  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1696  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1697  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1698  the path are separated by "/".
1699
1700**offset**
1701  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1702  is from the start of the file specified by the ``file_path``, and if omitted
1703  defaults to 0. For a memory URI, it is the memory address and is required.
1704
1705**size**
1706  Is the number of bytes in the code object.  For a file URI, if omitted it
1707  defaults to the size of the file.  It is required for a memory URI.
1708
1709**process_id**
1710  Is the identity of the process owning the memory.  For Linux it is the C
1711  unsigned integral decimal literal for the process ID (PID).
1712
1713For example:
1714
1715.. code::
1716
1717  file:///dir1/dir2/file1
1718  file:///dir3/dir4/file2#offset=0x2000&size=3000
1719  memory://1234#offset=0x20000&size=3000
1720
1721.. _amdgpu-dwarf-debug-information:
1722
1723DWARF Debug Information
1724=======================
1725
1726.. warning::
1727
1728   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1729   is not currently fully implemented and is subject to change.
1730
1731AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1732:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1733object executable code and data to the source language constructs. It can be
1734used by tools such as debuggers and profilers. It uses features defined in
1735:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1736DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1737
1738This section defines the AMDGPU target architecture specific DWARF mappings.
1739
1740.. _amdgpu-dwarf-register-identifier:
1741
1742Register Identifier
1743-------------------
1744
1745This section defines the AMDGPU target architecture register numbers used in
1746DWARF operation expressions (see DWARF Version 5 section 2.5 and
1747:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1748instructions (see DWARF Version 5 section 6.4 and
1749:ref:`amdgpu-dwarf-call-frame-information`).
1750
1751A single code object can contain code for kernels that have different wavefront
1752sizes. The vector registers and some scalar registers are based on the wavefront
1753size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1754simplifies the consumer of the DWARF so that each register has a fixed size,
1755rather than being dynamic according to the wavefront size mode. Similarly,
1756distinct DWARF registers are defined for those registers that vary in size
1757according to the process address size. This allows a consumer to treat a
1758specific AMDGPU processor as a single architecture regardless of how it is
1759configured at run time. The compiler explicitly specifies the DWARF registers
1760that match the mode in which the code it is generating will be executed.
1761
1762DWARF registers are encoded as numbers, which are mapped to architecture
1763registers. The mapping for AMDGPU is defined in
1764:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1765mapping.
1766
1767.. table:: AMDGPU DWARF Register Mapping
1768   :name: amdgpu-dwarf-register-mapping-table
1769
1770   ============== ================= ======== ==================================
1771   DWARF Register AMDGPU Register   Bit Size Description
1772   ============== ================= ======== ==================================
1773   0              PC_32             32       Program Counter (PC) when
1774                                             executing in a 32-bit process
1775                                             address space. Used in the CFI to
1776                                             describe the PC of the calling
1777                                             frame.
1778   1              EXEC_MASK_32      32       Execution Mask Register when
1779                                             executing in wavefront 32 mode.
1780   2-15           *Reserved*                 *Reserved for highly accessed
1781                                             registers using DWARF shortcut.*
1782   16             PC_64             64       Program Counter (PC) when
1783                                             executing in a 64-bit process
1784                                             address space. Used in the CFI to
1785                                             describe the PC of the calling
1786                                             frame.
1787   17             EXEC_MASK_64      64       Execution Mask Register when
1788                                             executing in wavefront 64 mode.
1789   18-31          *Reserved*                 *Reserved for highly accessed
1790                                             registers using DWARF shortcut.*
1791   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1792                                             Registers.
1793   96-127         *Reserved*                 *Reserved for frequently accessed
1794                                             registers using DWARF 1-byte ULEB.*
1795   128            STATUS            32       Status Register.
1796   129-511        *Reserved*                 *Reserved for future Scalar
1797                                             Architectural Registers.*
1798   512            VCC_32            32       Vector Condition Code Register
1799                                             when executing in wavefront 32
1800                                             mode.
1801   513-767        *Reserved*                 *Reserved for future Vector
1802                                             Architectural Registers when
1803                                             executing in wavefront 32 mode.*
1804   768            VCC_64            64       Vector Condition Code Register
1805                                             when executing in wavefront 64
1806                                             mode.
1807   769-1023       *Reserved*                 *Reserved for future Vector
1808                                             Architectural Registers when
1809                                             executing in wavefront 64 mode.*
1810   1024-1087      *Reserved*                 *Reserved for padding.*
1811   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1812   1130-1535      *Reserved*                 *Reserved for future Scalar
1813                                             General Purpose Registers.*
1814   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1815                                             when executing in wavefront 32
1816                                             mode.
1817   1792-2047      *Reserved*                 *Reserved for future Vector
1818                                             General Purpose Registers when
1819                                             executing in wavefront 32 mode.*
1820   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1821                                             when executing in wavefront 32
1822                                             mode.
1823   2304-2559      *Reserved*                 *Reserved for future Vector
1824                                             Accumulation Registers when
1825                                             executing in wavefront 32 mode.*
1826   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1827                                             when executing in wavefront 64
1828                                             mode.
1829   2816-3071      *Reserved*                 *Reserved for future Vector
1830                                             General Purpose Registers when
1831                                             executing in wavefront 64 mode.*
1832   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1833                                             when executing in wavefront 64
1834                                             mode.
1835   3328-3583      *Reserved*                 *Reserved for future Vector
1836                                             Accumulation Registers when
1837                                             executing in wavefront 64 mode.*
1838   ============== ================= ======== ==================================
1839
1840The vector registers are represented as the full size for the wavefront. They
1841are organized as consecutive dwords (32-bits), one per lane, with the dword at
1842the least significant bit position corresponding to lane 0 and so forth. DWARF
1843location expressions involving the ``DW_OP_LLVM_offset`` and
1844``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1845register corresponding to the lane that is executing the current thread of
1846execution in languages that are implemented using a SIMD or SIMT execution
1847model.
1848
1849If the wavefront size is 32 lanes then the wavefront 32 mode register
1850definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1851mode register definitions are used. Some AMDGPU targets support executing in
1852both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1853to the wavefront mode of the generated code will be used.
1854
1855If code is generated to execute in a 32-bit process address space, then the
185632-bit process address space register definitions are used. If code is generated
1857to execute in a 64-bit process address space, then the 64-bit process address
1858space register definitions are used. The ``amdgcn`` target only supports the
185964-bit process address space.
1860
1861.. _amdgpu-dwarf-address-class-identifier:
1862
1863Address Class Identifier
1864------------------------
1865
1866The DWARF address class represents the source language memory space. See DWARF
1867Version 5 section 2.12 which is updated by the *DWARF Extensions For
1868Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1869
1870The DWARF address class mapping used for AMDGPU is defined in
1871:ref:`amdgpu-dwarf-address-class-mapping-table`.
1872
1873.. table:: AMDGPU DWARF Address Class Mapping
1874   :name: amdgpu-dwarf-address-class-mapping-table
1875
1876   ========================= ====== =================
1877   DWARF                            AMDGPU
1878   -------------------------------- -----------------
1879   Address Class Name        Value  Address Space
1880   ========================= ====== =================
1881   ``DW_ADDR_none``          0x0000 Generic (Flat)
1882   ``DW_ADDR_LLVM_global``   0x0001 Global
1883   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1884   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1885   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1886   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1887   ========================= ====== =================
1888
1889The DWARF address class values defined in the *DWARF Extensions For
1890Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1891
1892In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1893available for use for the AMD extension for access to the hardware GDS memory
1894which is scratchpad memory allocated per device.
1895
1896For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1897address class of ``DW_ADDR_none`` is used.
1898
1899See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1900mapping of DWARF address classes to DWARF address spaces, including address size
1901and NULL value.
1902
1903.. _amdgpu-dwarf-address-space-identifier:
1904
1905Address Space Identifier
1906------------------------
1907
1908DWARF address spaces correspond to target architecture specific linear
1909addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1910For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1911
1912The DWARF address space mapping used for AMDGPU is defined in
1913:ref:`amdgpu-dwarf-address-space-mapping-table`.
1914
1915.. table:: AMDGPU DWARF Address Space Mapping
1916   :name: amdgpu-dwarf-address-space-mapping-table
1917
1918   ======================================= ===== ======= ======== ================= =======================
1919   DWARF                                                          AMDGPU            Notes
1920   --------------------------------------- ----- ---------------- ----------------- -----------------------
1921   Address Space Name                      Value Address Bit Size Address Space
1922   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1923   ..                                            64-bit  32-bit
1924                                                 process process
1925                                                 address address
1926                                                 space   space
1927   ======================================= ===== ======= ======== ================= =======================
1928   ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1929   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1930   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1931   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1932   *Reserved*                              0x04
1933   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1934   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1935   ======================================= ===== ======= ======== ================= =======================
1936
1937See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1938including address size and NULL value.
1939
1940The ``DW_ASPACE_none`` address space is the default target architecture address
1941space used in DWARF operations that do not specify an address space. It
1942therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1943related operations can refer to addresses in the program code.
1944
1945The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1946specify the flat address space. If the address corresponds to an address in the
1947local address space, then it corresponds to the wavefront that is executing the
1948focused thread of execution. If the address corresponds to an address in the
1949private address space, then it corresponds to the lane that is executing the
1950focused thread of execution for languages that are implemented using a SIMD or
1951SIMT execution model.
1952
1953.. note::
1954
1955  CUDA-like languages such as HIP that do not have address spaces in the
1956  language type system, but do allow variables to be allocated in different
1957  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1958  address space in the DWARF expression operations as the default address space
1959  is the global address space.
1960
1961The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1962specify the local address space corresponding to the wavefront that is executing
1963the focused thread of execution.
1964
1965The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1966to specify the private address space corresponding to the lane that is executing
1967the focused thread of execution for languages that are implemented using a SIMD
1968or SIMT execution model.
1969
1970The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1971to specify the unswizzled private address space corresponding to the wavefront
1972that is executing the focused thread of execution. The wavefront view of private
1973memory is the per wavefront unswizzled backing memory layout defined in
1974:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1975location for the backing memory of the wavefront (namely the address is not
1976offset by ``wavefront-scratch-base``). The following formula can be used to
1977convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1978``DW_ASPACE_AMDGPU_private_wave`` address:
1979
1980::
1981
1982  private-address-wavefront =
1983    ((private-address-lane / 4) * wavefront-size * 4) +
1984    (wavefront-lane-id * 4) + (private-address-lane % 4)
1985
1986If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1987of the dwords for each lane starting with lane 0 is required, then this
1988simplifies to:
1989
1990::
1991
1992  private-address-wavefront =
1993    private-address-lane * wavefront-size
1994
1995A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1996complete spilled vector register back into a complete vector register in the
1997CFI. The frame pointer can be a private lane address which is dword aligned,
1998which can be shifted to multiply by the wavefront size, and then used to form a
1999private wavefront address that gives a location for a contiguous set of dwords,
2000one per lane, where the vector register dwords are spilled. The compiler knows
2001the wavefront size since it generates the code. Note that the type of the
2002address may have to be converted as the size of a
2003``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2004``DW_ASPACE_AMDGPU_private_wave`` address.
2005
2006.. _amdgpu-dwarf-lane-identifier:
2007
2008Lane identifier
2009---------------
2010
2011DWARF lane identifies specify a target architecture lane position for hardware
2012that executes in a SIMD or SIMT manner, and on which a source language maps its
2013threads of execution onto those lanes. The DWARF lane identifier is pushed by
2014the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2015section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2016section :ref:`amdgpu-dwarf-operation-expressions`.
2017
2018For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2019wavefront. It is numbered from 0 to the wavefront size minus 1.
2020
2021Operation Expressions
2022---------------------
2023
2024DWARF expressions are used to compute program values and the locations of
2025program objects. See DWARF Version 5 section 2.5 and
2026:ref:`amdgpu-dwarf-operation-expressions`.
2027
2028DWARF location descriptions describe how to access storage which includes memory
2029and registers. When accessing storage on AMDGPU, bytes are ordered with least
2030significant bytes first, and bits are ordered within bytes with least
2031significant bits first.
2032
2033For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2034unwinding vector registers that are spilled under the execution mask to memory:
2035the zero-single location description is the vector register, and the one-single
2036location description is the spilled memory location description. The
2037``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2038memory location description.
2039
2040In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2041``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2042controlled by the execution mask. An undefined location description together
2043with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2044to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2045
2046Debugger Information Entry Attributes
2047-------------------------------------
2048
2049This section describes how certain debugger information entry attributes are
2050used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2051which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2052:ref:`amdgpu-dwarf-low-level-information` and
2053:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2054
2055.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2056
2057``DW_AT_LLVM_lane_pc``
2058~~~~~~~~~~~~~~~~~~~~~~
2059
2060For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2061location of the separate lanes of a SIMT thread.
2062
2063If the lane is an active lane then this will be the same as the current program
2064location.
2065
2066If the lane is inactive, but was active on entry to the subprogram, then this is
2067the program location in the subprogram at which execution of the lane is
2068conceptual positioned.
2069
2070If the lane was not active on entry to the subprogram, then this will be the
2071undefined location. A client debugger can check if the lane is part of a valid
2072work-group by checking that the lane is in the range of the associated
2073work-group within the grid, accounting for partial work-groups. If it is not,
2074then the debugger can omit any information for the lane. Otherwise, the debugger
2075may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2076calling subprogram until it finds a non-undefined location. Conceptually the
2077lane only has the call frames that it has a non-undefined
2078``DW_AT_LLVM_lane_pc``.
2079
2080The following example illustrates how the AMDGPU backend can generate a DWARF
2081location list expression for the nested ``IF/THEN/ELSE`` structures of the
2082following subprogram pseudo code for a target with 64 lanes per wavefront.
2083
2084.. code::
2085  :number-lines:
2086
2087  SUBPROGRAM X
2088  BEGIN
2089    a;
2090    IF (c1) THEN
2091      b;
2092      IF (c2) THEN
2093        c;
2094      ELSE
2095        d;
2096      ENDIF
2097      e;
2098    ELSE
2099      f;
2100    ENDIF
2101    g;
2102  END
2103
2104The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2105execution mask (``EXEC``) to linearize the control flow. The condition is
2106evaluated to make a mask of the lanes for which the condition evaluates to true.
2107First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2108logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2109``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2110the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2111region the ``EXEC`` mask is restored to the value it had at the beginning of the
2112region. This is shown below. Other approaches are possible, but the basic
2113concept is the same.
2114
2115.. code::
2116  :number-lines:
2117
2118  $lex_start:
2119    a;
2120    %1 = EXEC
2121    %2 = c1
2122  $lex_1_start:
2123    EXEC = %1 & %2
2124  $if_1_then:
2125      b;
2126      %3 = EXEC
2127      %4 = c2
2128  $lex_1_1_start:
2129      EXEC = %3 & %4
2130  $lex_1_1_then:
2131        c;
2132      EXEC = ~EXEC & %3
2133  $lex_1_1_else:
2134        d;
2135      EXEC = %3
2136  $lex_1_1_end:
2137      e;
2138    EXEC = ~EXEC & %1
2139  $lex_1_else:
2140      f;
2141    EXEC = %1
2142  $lex_1_end:
2143    g;
2144  $lex_end:
2145
2146To create the DWARF location list expression that defines the location
2147description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2148pseudo instruction can be used to annotate the linearized control flow. This can
2149be done by defining an artificial variable for the lane PC. The DWARF location
2150list expression created for it is used as the value of the
2151``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2152
2153A DWARF procedure is defined for each well nested structured control flow region
2154which provides the conceptual lane program location for a lane if it is not
2155active (namely it is divergent). The DWARF operation expression for each region
2156conceptually inherits the value of the immediately enclosing region and modifies
2157it according to the semantics of the region.
2158
2159For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2160the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2161region the divergent program location is at the end of the ``IF/THEN/ELSE``
2162region since the ``THEN`` region has completed.
2163
2164The lane PC artificial variable is assigned at each region transition. It uses
2165the immediately enclosing region's DWARF procedure to compute the program
2166location for each lane assuming they are divergent, and then modifies the result
2167by inserting the current program location for each lane that the ``EXEC`` mask
2168indicates is active.
2169
2170By having separate DWARF procedures for each region, they can be reused to
2171define the value for any nested region. This reduces the total size of the DWARF
2172operation expressions.
2173
2174The following provides an example using pseudo LLVM MIR.
2175
2176.. code::
2177  :number-lines:
2178
2179  $lex_start:
2180    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2181      DW_AT_name = "__uint64";
2182      DW_AT_byte_size = 8;
2183      DW_AT_encoding = DW_ATE_unsigned;
2184    ];
2185    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2186      DW_AT_name = "__active_lane_pc";
2187      DW_AT_location = [
2188        DW_OP_regx PC;
2189        DW_OP_LLVM_extend 64, 64;
2190        DW_OP_regval_type EXEC, %uint_64;
2191        DW_OP_LLVM_select_bit_piece 64, 64;
2192      ];
2193    ];
2194    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2195      DW_AT_name = "__divergent_lane_pc";
2196      DW_AT_location = [
2197        DW_OP_LLVM_undefined;
2198        DW_OP_LLVM_extend 64, 64;
2199      ];
2200    ];
2201    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2202      DW_OP_call_ref %__divergent_lane_pc;
2203      DW_OP_call_ref %__active_lane_pc;
2204    ];
2205    a;
2206    %1 = EXEC;
2207    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2208    %2 = c1;
2209  $lex_1_start:
2210    EXEC = %1 & %2;
2211  $lex_1_then:
2212      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2213        DW_AT_name = "__divergent_lane_pc_1_then";
2214        DW_AT_location = DIExpression[
2215          DW_OP_call_ref %__divergent_lane_pc;
2216          DW_OP_addrx &lex_1_start;
2217          DW_OP_stack_value;
2218          DW_OP_LLVM_extend 64, 64;
2219          DW_OP_call_ref %__lex_1_save_exec;
2220          DW_OP_deref_type 64, %__uint_64;
2221          DW_OP_LLVM_select_bit_piece 64, 64;
2222        ];
2223      ];
2224      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2225        DW_OP_call_ref %__divergent_lane_pc_1_then;
2226        DW_OP_call_ref %__active_lane_pc;
2227      ];
2228      b;
2229      %3 = EXEC;
2230      DBG_VALUE %3, %__lex_1_1_save_exec;
2231      %4 = c2;
2232  $lex_1_1_start:
2233      EXEC = %3 & %4;
2234  $lex_1_1_then:
2235        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2236          DW_AT_name = "__divergent_lane_pc_1_1_then";
2237          DW_AT_location = DIExpression[
2238            DW_OP_call_ref %__divergent_lane_pc_1_then;
2239            DW_OP_addrx &lex_1_1_start;
2240            DW_OP_stack_value;
2241            DW_OP_LLVM_extend 64, 64;
2242            DW_OP_call_ref %__lex_1_1_save_exec;
2243            DW_OP_deref_type 64, %__uint_64;
2244            DW_OP_LLVM_select_bit_piece 64, 64;
2245          ];
2246        ];
2247        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2248          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2249          DW_OP_call_ref %__active_lane_pc;
2250        ];
2251        c;
2252      EXEC = ~EXEC & %3;
2253  $lex_1_1_else:
2254        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2255          DW_AT_name = "__divergent_lane_pc_1_1_else";
2256          DW_AT_location = DIExpression[
2257            DW_OP_call_ref %__divergent_lane_pc_1_then;
2258            DW_OP_addrx &lex_1_1_end;
2259            DW_OP_stack_value;
2260            DW_OP_LLVM_extend 64, 64;
2261            DW_OP_call_ref %__lex_1_1_save_exec;
2262            DW_OP_deref_type 64, %__uint_64;
2263            DW_OP_LLVM_select_bit_piece 64, 64;
2264          ];
2265        ];
2266        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2267          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2268          DW_OP_call_ref %__active_lane_pc;
2269        ];
2270        d;
2271      EXEC = %3;
2272  $lex_1_1_end:
2273      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2274        DW_OP_call_ref %__divergent_lane_pc;
2275        DW_OP_call_ref %__active_lane_pc;
2276      ];
2277      e;
2278    EXEC = ~EXEC & %1;
2279  $lex_1_else:
2280      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2281        DW_AT_name = "__divergent_lane_pc_1_else";
2282        DW_AT_location = DIExpression[
2283          DW_OP_call_ref %__divergent_lane_pc;
2284          DW_OP_addrx &lex_1_end;
2285          DW_OP_stack_value;
2286          DW_OP_LLVM_extend 64, 64;
2287          DW_OP_call_ref %__lex_1_save_exec;
2288          DW_OP_deref_type 64, %__uint_64;
2289          DW_OP_LLVM_select_bit_piece 64, 64;
2290        ];
2291      ];
2292      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2293        DW_OP_call_ref %__divergent_lane_pc_1_else;
2294        DW_OP_call_ref %__active_lane_pc;
2295      ];
2296      f;
2297    EXEC = %1;
2298  $lex_1_end:
2299    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2300      DW_OP_call_ref %__divergent_lane_pc;
2301      DW_OP_call_ref %__active_lane_pc;
2302    ];
2303    g;
2304  $lex_end:
2305
2306The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2307that are active, with the current program location.
2308
2309Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2310the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2311instruction, location list entries will be created that describe where the
2312artificial variables are allocated at any given program location. The compiler
2313may allocate them to registers or spill them to memory.
2314
2315The DWARF procedures for each region use the values of the saved execution mask
2316artificial variables to only update the lanes that are active on entry to the
2317region. All other lanes retain the value of the enclosing region where they were
2318last active. If they were not active on entry to the subprogram, then will have
2319the undefined location description.
2320
2321Other structured control flow regions can be handled similarly. For example,
2322loops would set the divergent program location for the region at the end of the
2323loop. Any lanes active will be in the loop, and any lanes not active must have
2324exited the loop.
2325
2326An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2327``IF/THEN/ELSE`` regions.
2328
2329The DWARF procedures can use the active lane artificial variable described in
2330:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2331``EXEC`` mask in order to support whole or quad wavefront mode.
2332
2333.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2334
2335``DW_AT_LLVM_active_lane``
2336~~~~~~~~~~~~~~~~~~~~~~~~~~
2337
2338The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2339entry is used to specify the lanes that are conceptually active for a SIMT
2340thread.
2341
2342The execution mask may be modified to implement whole or quad wavefront mode
2343operations. For example, all lanes may need to temporarily be made active to
2344execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2345update it to enable the necessary lanes, perform the operations, and then
2346restore the ``EXEC`` mask from the saved value. While executing the whole
2347wavefront region, the conceptual execution mask is the saved value, not the
2348``EXEC`` value.
2349
2350This is handled by defining an artificial variable for the active lane mask. The
2351active lane mask artificial variable would be the actual ``EXEC`` mask for
2352normal regions, and the saved execution mask for regions where the mask is
2353temporarily updated. The location list expression created for this artificial
2354variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2355attribute.
2356
2357``DW_AT_LLVM_augmentation``
2358~~~~~~~~~~~~~~~~~~~~~~~~~~~
2359
2360For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2361debugger information entry has the following value for the augmentation string:
2362
2363::
2364
2365  [amdgpu:v0.0]
2366
2367The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2368extensions used in the DWARF of the compilation unit. The version number
2369conforms to [SEMVER]_.
2370
2371Call Frame Information
2372----------------------
2373
2374DWARF Call Frame Information (CFI) describes how a consumer can virtually
2375*unwind* call frames in a running process or core dump. See DWARF Version 5
2376section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2377
2378For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2379
23801.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2381
2382    ::
2383
2384      [amd:v0.0]
2385
2386    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2387    extensions used in this CIE or to the FDEs that use it. The version number
2388    conforms to [SEMVER]_.
2389
23902.  ``address_size`` for the ``Global`` address space is defined in
2391    :ref:`amdgpu-dwarf-address-space-identifier`.
2392
23933.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2394
23954.  ``code_alignment_factor`` is 4 bytes.
2396
2397    .. TODO::
2398
2399       Add to :ref:`amdgpu-processor-table` table.
2400
24015.  ``data_alignment_factor`` is 4 bytes.
2402
2403    .. TODO::
2404
2405       Add to :ref:`amdgpu-processor-table` table.
2406
24076.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2408    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2409
24107.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2411    called from subprogram Y that has more allocated, X will not change any of
2412    the extra registers as it cannot access them. Therefore, the default rule
2413    for all columns is ``same value``.
2414
2415For AMDGPU the register number follows the numbering defined in
2416:ref:`amdgpu-dwarf-register-identifier`.
2417
2418For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2419the return address to get the address of a byte within the call site
2420instructions. See DWARF Version 5 section 6.4.4.
2421
2422Accelerated Access
2423------------------
2424
2425See DWARF Version 5 section 6.1.
2426
2427Lookup By Name Section Header
2428~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2429
2430See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2431
2432For AMDGPU the lookup by name section header table:
2433
2434``augmentation_string_size`` (uword)
2435
2436  Set to the length of the ``augmentation_string`` value which is always a
2437  multiple of 4.
2438
2439``augmentation_string`` (sequence of UTF-8 characters)
2440
2441  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2442
2443  ::
2444
2445    [amdgpu:v0.0]
2446
2447  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2448  extensions used in the DWARF of this index. The version number conforms to
2449  [SEMVER]_.
2450
2451  .. note::
2452
2453    This is different to the DWARF Version 5 definition that requires the first
2454    4 characters to be the vendor ID. But this is consistent with the other
2455    augmentation strings and does allow multiple vendor contributions. However,
2456    backwards compatibility may be more desirable.
2457
2458Lookup By Address Section Header
2459~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2460
2461See DWARF Version 5 section 6.1.2.
2462
2463For AMDGPU the lookup by address section header table:
2464
2465``address_size`` (ubyte)
2466
2467  Match the address size for the ``Global`` address space defined in
2468  :ref:`amdgpu-dwarf-address-space-identifier`.
2469
2470``segment_selector_size`` (ubyte)
2471
2472  AMDGPU does not use a segment selector so this is 0. The entries in the
2473  ``.debug_aranges`` do not have a segment selector.
2474
2475Line Number Information
2476-----------------------
2477
2478See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2479
2480AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2481The instruction set must be obtained from the ELF file header ``e_flags`` field
2482in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2483<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2484
2485.. TODO::
2486
2487  Should the ``isa`` state machine register be used to indicate if the code is
2488  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2489
2490For AMDGPU the line number program header fields have the following values (see
2491DWARF Version 5 section 6.2.4):
2492
2493``address_size`` (ubyte)
2494  Matches the address size for the ``Global`` address space defined in
2495  :ref:`amdgpu-dwarf-address-space-identifier`.
2496
2497``segment_selector_size`` (ubyte)
2498  AMDGPU does not use a segment selector so this is 0.
2499
2500``minimum_instruction_length`` (ubyte)
2501  For GFX9-GFX10 this is 4.
2502
2503``maximum_operations_per_instruction`` (ubyte)
2504  For GFX9-GFX10 this is 1.
2505
2506Source text for online-compiled programs (for example, those compiled by the
2507OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2508See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2509Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2510<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2511
2512The Clang option used to control source embedding in AMDGPU is defined in
2513:ref:`amdgpu-clang-debug-options-table`.
2514
2515  .. table:: AMDGPU Clang Debug Options
2516     :name: amdgpu-clang-debug-options-table
2517
2518     ==================== ==================================================
2519     Debug Flag           Description
2520     ==================== ==================================================
2521     -g[no-]embed-source  Enable/disable embedding source text in DWARF
2522                          debug sections. Useful for environments where
2523                          source cannot be written to disk, such as
2524                          when performing online compilation.
2525     ==================== ==================================================
2526
2527For example:
2528
2529``-gembed-source``
2530  Enable the embedded source.
2531
2532``-gno-embed-source``
2533  Disable the embedded source.
2534
253532-Bit and 64-Bit DWARF Formats
2536-------------------------------
2537
2538See DWARF Version 5 section 7.4 and
2539:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2540
2541For AMDGPU:
2542
2543* For the ``amdgcn`` target architecture only the 64-bit process address space
2544  is supported.
2545
2546* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2547  the 32-bit DWARF format.
2548
2549Unit Headers
2550------------
2551
2552For AMDGPU the following values apply for each of the unit headers described in
2553DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2554
2555``address_size`` (ubyte)
2556  Matches the address size for the ``Global`` address space defined in
2557  :ref:`amdgpu-dwarf-address-space-identifier`.
2558
2559.. _amdgpu-code-conventions:
2560
2561Code Conventions
2562================
2563
2564This section provides code conventions used for each supported target triple OS
2565(see :ref:`amdgpu-target-triples`).
2566
2567AMDHSA
2568------
2569
2570This section provides code conventions used when the target triple OS is
2571``amdhsa`` (see :ref:`amdgpu-target-triples`).
2572
2573.. _amdgpu-amdhsa-code-object-metadata:
2574
2575Code Object Metadata
2576~~~~~~~~~~~~~~~~~~~~
2577
2578The code object metadata specifies extensible metadata associated with the code
2579objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2580encoding and semantics of this metadata depends on the code object version; see
2581:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2582:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2583:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2584:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2585
2586Code object metadata is specified in a note record (see
2587:ref:`amdgpu-note-records`) and is required when the target triple OS is
2588``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2589information necessary to support the HSA compatible runtime kernel queries. For
2590example, the segment sizes needed in a dispatch packet. In addition, a
2591high-level language runtime may require other information to be included. For
2592example, the AMD OpenCL runtime records kernel argument information.
2593
2594.. _amdgpu-amdhsa-code-object-metadata-v2:
2595
2596Code Object V2 Metadata
2597+++++++++++++++++++++++
2598
2599.. warning::
2600  Code object V2 is not the default code object version emitted by this version
2601  of LLVM.
2602
2603Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2604(see :ref:`amdgpu-note-records-v2`).
2605
2606The metadata is specified as a YAML formatted string (see [YAML]_ and
2607:doc:`YamlIO`).
2608
2609.. TODO::
2610
2611  Is the string null terminated? It probably should not if YAML allows it to
2612  contain null characters, otherwise it should be.
2613
2614The metadata is represented as a single YAML document comprised of the mapping
2615defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2616referenced tables.
2617
2618For boolean values, the string values of ``false`` and ``true`` are used for
2619false and true respectively.
2620
2621Additional information can be added to the mappings. To avoid conflicts, any
2622non-AMD key names should be prefixed by "*vendor-name*.".
2623
2624  .. table:: AMDHSA Code Object V2 Metadata Map
2625     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2626
2627     ========== ============== ========= =======================================
2628     String Key Value Type     Required? Description
2629     ========== ============== ========= =======================================
2630     "Version"  sequence of    Required  - The first integer is the major
2631                2 integers                 version. Currently 1.
2632                                         - The second integer is the minor
2633                                           version. Currently 0.
2634     "Printf"   sequence of              Each string is encoded information
2635                strings                  about a printf function call. The
2636                                         encoded information is organized as
2637                                         fields separated by colon (':'):
2638
2639                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2640
2641                                         where:
2642
2643                                         ``ID``
2644                                           A 32-bit integer as a unique id for
2645                                           each printf function call
2646
2647                                         ``N``
2648                                           A 32-bit integer equal to the number
2649                                           of arguments of printf function call
2650                                           minus 1
2651
2652                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2653                                           32-bit integers for the size in bytes
2654                                           of the i-th FormatString argument of
2655                                           the printf function call
2656
2657                                         FormatString
2658                                           The format string passed to the
2659                                           printf function call.
2660     "Kernels"  sequence of    Required  Sequence of the mappings for each
2661                mapping                  kernel in the code object. See
2662                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2663                                         for the definition of the mapping.
2664     ========== ============== ========= =======================================
2665
2666..
2667
2668  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2669     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2670
2671     ================= ============== ========= ================================
2672     String Key        Value Type     Required? Description
2673     ================= ============== ========= ================================
2674     "Name"            string         Required  Source name of the kernel.
2675     "SymbolName"      string         Required  Name of the kernel
2676                                                descriptor ELF symbol.
2677     "Language"        string                   Source language of the kernel.
2678                                                Values include:
2679
2680                                                - "OpenCL C"
2681                                                - "OpenCL C++"
2682                                                - "HCC"
2683                                                - "OpenMP"
2684
2685     "LanguageVersion" sequence of              - The first integer is the major
2686                       2 integers                 version.
2687                                                - The second integer is the
2688                                                  minor version.
2689     "Attrs"           mapping                  Mapping of kernel attributes.
2690                                                See
2691                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2692                                                for the mapping definition.
2693     "Args"            sequence of              Sequence of mappings of the
2694                       mapping                  kernel arguments. See
2695                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2696                                                for the definition of the mapping.
2697     "CodeProps"       mapping                  Mapping of properties related to
2698                                                the kernel code. See
2699                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2700                                                for the mapping definition.
2701     ================= ============== ========= ================================
2702
2703..
2704
2705  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2706     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2707
2708     =================== ============== ========= ==============================
2709     String Key          Value Type     Required? Description
2710     =================== ============== ========= ==============================
2711     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2712                         3 integers               must be >=1 and the dispatch
2713                                                  work-group size X, Y, Z must
2714                                                  correspond to the specified
2715                                                  values. Defaults to 0, 0, 0.
2716
2717                                                  Corresponds to the OpenCL
2718                                                  ``reqd_work_group_size``
2719                                                  attribute.
2720     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2721                         3 integers               X, Y, Z is likely to be the
2722                                                  specified values.
2723
2724                                                  Corresponds to the OpenCL
2725                                                  ``work_group_size_hint``
2726                                                  attribute.
2727     "VecTypeHint"       string                   The name of a scalar or vector
2728                                                  type.
2729
2730                                                  Corresponds to the OpenCL
2731                                                  ``vec_type_hint`` attribute.
2732
2733     "RuntimeHandle"     string                   The external symbol name
2734                                                  associated with a kernel.
2735                                                  OpenCL runtime allocates a
2736                                                  global buffer for the symbol
2737                                                  and saves the kernel's address
2738                                                  to it, which is used for
2739                                                  device side enqueueing. Only
2740                                                  available for device side
2741                                                  enqueued kernels.
2742     =================== ============== ========= ==============================
2743
2744..
2745
2746  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2747     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2748
2749     ================= ============== ========= ================================
2750     String Key        Value Type     Required? Description
2751     ================= ============== ========= ================================
2752     "Name"            string                   Kernel argument name.
2753     "TypeName"        string                   Kernel argument type name.
2754     "Size"            integer        Required  Kernel argument size in bytes.
2755     "Align"           integer        Required  Kernel argument alignment in
2756                                                bytes. Must be a power of two.
2757     "ValueKind"       string         Required  Kernel argument kind that
2758                                                specifies how to set up the
2759                                                corresponding argument.
2760                                                Values include:
2761
2762                                                "ByValue"
2763                                                  The argument is copied
2764                                                  directly into the kernarg.
2765
2766                                                "GlobalBuffer"
2767                                                  A global address space pointer
2768                                                  to the buffer data is passed
2769                                                  in the kernarg.
2770
2771                                                "DynamicSharedPointer"
2772                                                  A group address space pointer
2773                                                  to dynamically allocated LDS
2774                                                  is passed in the kernarg.
2775
2776                                                "Sampler"
2777                                                  A global address space
2778                                                  pointer to a S# is passed in
2779                                                  the kernarg.
2780
2781                                                "Image"
2782                                                  A global address space
2783                                                  pointer to a T# is passed in
2784                                                  the kernarg.
2785
2786                                                "Pipe"
2787                                                  A global address space pointer
2788                                                  to an OpenCL pipe is passed in
2789                                                  the kernarg.
2790
2791                                                "Queue"
2792                                                  A global address space pointer
2793                                                  to an OpenCL device enqueue
2794                                                  queue is passed in the
2795                                                  kernarg.
2796
2797                                                "HiddenGlobalOffsetX"
2798                                                  The OpenCL grid dispatch
2799                                                  global offset for the X
2800                                                  dimension is passed in the
2801                                                  kernarg.
2802
2803                                                "HiddenGlobalOffsetY"
2804                                                  The OpenCL grid dispatch
2805                                                  global offset for the Y
2806                                                  dimension is passed in the
2807                                                  kernarg.
2808
2809                                                "HiddenGlobalOffsetZ"
2810                                                  The OpenCL grid dispatch
2811                                                  global offset for the Z
2812                                                  dimension is passed in the
2813                                                  kernarg.
2814
2815                                                "HiddenNone"
2816                                                  An argument that is not used
2817                                                  by the kernel. Space needs to
2818                                                  be left for it, but it does
2819                                                  not need to be set up.
2820
2821                                                "HiddenPrintfBuffer"
2822                                                  A global address space pointer
2823                                                  to the runtime printf buffer
2824                                                  is passed in kernarg.
2825
2826                                                "HiddenHostcallBuffer"
2827                                                  A global address space pointer
2828                                                  to the runtime hostcall buffer
2829                                                  is passed in kernarg.
2830
2831                                                "HiddenDefaultQueue"
2832                                                  A global address space pointer
2833                                                  to the OpenCL device enqueue
2834                                                  queue that should be used by
2835                                                  the kernel by default is
2836                                                  passed in the kernarg.
2837
2838                                                "HiddenCompletionAction"
2839                                                  A global address space pointer
2840                                                  to help link enqueued kernels into
2841                                                  the ancestor tree for determining
2842                                                  when the parent kernel has finished.
2843
2844                                                "HiddenMultiGridSyncArg"
2845                                                  A global address space pointer for
2846                                                  multi-grid synchronization is
2847                                                  passed in the kernarg.
2848
2849     "ValueType"       string                   Unused and deprecated. This should no longer
2850                                                be emitted, but is accepted for compatibility.
2851
2852
2853     "PointeeAlign"    integer                  Alignment in bytes of pointee
2854                                                type for pointer type kernel
2855                                                argument. Must be a power
2856                                                of 2. Only present if
2857                                                "ValueKind" is
2858                                                "DynamicSharedPointer".
2859     "AddrSpaceQual"   string                   Kernel argument address space
2860                                                qualifier. Only present if
2861                                                "ValueKind" is "GlobalBuffer" or
2862                                                "DynamicSharedPointer". Values
2863                                                are:
2864
2865                                                - "Private"
2866                                                - "Global"
2867                                                - "Constant"
2868                                                - "Local"
2869                                                - "Generic"
2870                                                - "Region"
2871
2872                                                .. TODO::
2873
2874                                                   Is GlobalBuffer only Global
2875                                                   or Constant? Is
2876                                                   DynamicSharedPointer always
2877                                                   Local? Can HCC allow Generic?
2878                                                   How can Private or Region
2879                                                   ever happen?
2880
2881     "AccQual"         string                   Kernel argument access
2882                                                qualifier. Only present if
2883                                                "ValueKind" is "Image" or
2884                                                "Pipe". Values
2885                                                are:
2886
2887                                                - "ReadOnly"
2888                                                - "WriteOnly"
2889                                                - "ReadWrite"
2890
2891                                                .. TODO::
2892
2893                                                   Does this apply to
2894                                                   GlobalBuffer?
2895
2896     "ActualAccQual"   string                   The actual memory accesses
2897                                                performed by the kernel on the
2898                                                kernel argument. Only present if
2899                                                "ValueKind" is "GlobalBuffer",
2900                                                "Image", or "Pipe". This may be
2901                                                more restrictive than indicated
2902                                                by "AccQual" to reflect what the
2903                                                kernel actual does. If not
2904                                                present then the runtime must
2905                                                assume what is implied by
2906                                                "AccQual" and "IsConst". Values
2907                                                are:
2908
2909                                                - "ReadOnly"
2910                                                - "WriteOnly"
2911                                                - "ReadWrite"
2912
2913     "IsConst"         boolean                  Indicates if the kernel argument
2914                                                is const qualified. Only present
2915                                                if "ValueKind" is
2916                                                "GlobalBuffer".
2917
2918     "IsRestrict"      boolean                  Indicates if the kernel argument
2919                                                is restrict qualified. Only
2920                                                present if "ValueKind" is
2921                                                "GlobalBuffer".
2922
2923     "IsVolatile"      boolean                  Indicates if the kernel argument
2924                                                is volatile qualified. Only
2925                                                present if "ValueKind" is
2926                                                "GlobalBuffer".
2927
2928     "IsPipe"          boolean                  Indicates if the kernel argument
2929                                                is pipe qualified. Only present
2930                                                if "ValueKind" is "Pipe".
2931
2932                                                .. TODO::
2933
2934                                                   Can GlobalBuffer be pipe
2935                                                   qualified?
2936
2937     ================= ============== ========= ================================
2938
2939..
2940
2941  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2942     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2943
2944     ============================ ============== ========= =====================
2945     String Key                   Value Type     Required? Description
2946     ============================ ============== ========= =====================
2947     "KernargSegmentSize"         integer        Required  The size in bytes of
2948                                                           the kernarg segment
2949                                                           that holds the values
2950                                                           of the arguments to
2951                                                           the kernel.
2952     "GroupSegmentFixedSize"      integer        Required  The amount of group
2953                                                           segment memory
2954                                                           required by a
2955                                                           work-group in
2956                                                           bytes. This does not
2957                                                           include any
2958                                                           dynamically allocated
2959                                                           group segment memory
2960                                                           that may be added
2961                                                           when the kernel is
2962                                                           dispatched.
2963     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
2964                                                           private address space
2965                                                           memory required for a
2966                                                           work-item in
2967                                                           bytes. If the kernel
2968                                                           uses a dynamic call
2969                                                           stack then additional
2970                                                           space must be added
2971                                                           to this value for the
2972                                                           call stack.
2973     "KernargSegmentAlign"        integer        Required  The maximum byte
2974                                                           alignment of
2975                                                           arguments in the
2976                                                           kernarg segment. Must
2977                                                           be a power of 2.
2978     "WavefrontSize"              integer        Required  Wavefront size. Must
2979                                                           be a power of 2.
2980     "NumSGPRs"                   integer        Required  Number of scalar
2981                                                           registers used by a
2982                                                           wavefront for
2983                                                           GFX6-GFX10. This
2984                                                           includes the special
2985                                                           SGPRs for VCC, Flat
2986                                                           Scratch (GFX7-GFX10)
2987                                                           and XNACK (for
2988                                                           GFX8-GFX10). It does
2989                                                           not include the 16
2990                                                           SGPR added if a trap
2991                                                           handler is
2992                                                           enabled. It is not
2993                                                           rounded up to the
2994                                                           allocation
2995                                                           granularity.
2996     "NumVGPRs"                   integer        Required  Number of vector
2997                                                           registers used by
2998                                                           each work-item for
2999                                                           GFX6-GFX10
3000     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
3001                                                           work-group size
3002                                                           supported by the
3003                                                           kernel in work-items.
3004                                                           Must be >=1 and
3005                                                           consistent with
3006                                                           ReqdWorkGroupSize if
3007                                                           not 0, 0, 0.
3008     "NumSpilledSGPRs"            integer                  Number of stores from
3009                                                           a scalar register to
3010                                                           a register allocator
3011                                                           created spill
3012                                                           location.
3013     "NumSpilledVGPRs"            integer                  Number of stores from
3014                                                           a vector register to
3015                                                           a register allocator
3016                                                           created spill
3017                                                           location.
3018     ============================ ============== ========= =====================
3019
3020.. _amdgpu-amdhsa-code-object-metadata-v3:
3021
3022Code Object V3 Metadata
3023+++++++++++++++++++++++
3024
3025.. warning::
3026  Code object V3 is not the default code object version emitted by this version
3027  of LLVM.
3028
3029Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3030record (see :ref:`amdgpu-note-records-v3-onwards`).
3031
3032The metadata is represented as Message Pack formatted binary data (see
3033[MsgPack]_). The top level is a Message Pack map that includes the
3034keys defined in table
3035:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3036tables.
3037
3038Additional information can be added to the maps. To avoid conflicts,
3039any key names should be prefixed by "*vendor-name*." where
3040``vendor-name`` can be the name of the vendor and specific vendor
3041tool that generates the information. The prefix is abbreviated to
3042simply "." when it appears within a map that has been added by the
3043same *vendor-name*.
3044
3045  .. table:: AMDHSA Code Object V3 Metadata Map
3046     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3047
3048     ================= ============== ========= =======================================
3049     String Key        Value Type     Required? Description
3050     ================= ============== ========= =======================================
3051     "amdhsa.version"  sequence of    Required  - The first integer is the major
3052                       2 integers                 version. Currently 1.
3053                                                - The second integer is the minor
3054                                                  version. Currently 0.
3055     "amdhsa.printf"   sequence of              Each string is encoded information
3056                       strings                  about a printf function call. The
3057                                                encoded information is organized as
3058                                                fields separated by colon (':'):
3059
3060                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3061
3062                                                where:
3063
3064                                                ``ID``
3065                                                  A 32-bit integer as a unique id for
3066                                                  each printf function call
3067
3068                                                ``N``
3069                                                  A 32-bit integer equal to the number
3070                                                  of arguments of printf function call
3071                                                  minus 1
3072
3073                                                ``S[i]`` (where i = 0, 1, ... , N-1)
3074                                                  32-bit integers for the size in bytes
3075                                                  of the i-th FormatString argument of
3076                                                  the printf function call
3077
3078                                                FormatString
3079                                                  The format string passed to the
3080                                                  printf function call.
3081     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3082                       map                      kernel in the code object. See
3083                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3084                                                for the definition of the keys included
3085                                                in that map.
3086     ================= ============== ========= =======================================
3087
3088..
3089
3090  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3091     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3092
3093     =================================== ============== ========= ================================
3094     String Key                          Value Type     Required? Description
3095     =================================== ============== ========= ================================
3096     ".name"                             string         Required  Source name of the kernel.
3097     ".symbol"                           string         Required  Name of the kernel
3098                                                                  descriptor ELF symbol.
3099     ".language"                         string                   Source language of the kernel.
3100                                                                  Values include:
3101
3102                                                                  - "OpenCL C"
3103                                                                  - "OpenCL C++"
3104                                                                  - "HCC"
3105                                                                  - "HIP"
3106                                                                  - "OpenMP"
3107                                                                  - "Assembler"
3108
3109     ".language_version"                 sequence of              - The first integer is the major
3110                                         2 integers                 version.
3111                                                                  - The second integer is the
3112                                                                    minor version.
3113     ".args"                             sequence of              Sequence of maps of the
3114                                         map                      kernel arguments. See
3115                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3116                                                                  for the definition of the keys
3117                                                                  included in that map.
3118     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3119                                         3 integers               must be >=1 and the dispatch
3120                                                                  work-group size X, Y, Z must
3121                                                                  correspond to the specified
3122                                                                  values. Defaults to 0, 0, 0.
3123
3124                                                                  Corresponds to the OpenCL
3125                                                                  ``reqd_work_group_size``
3126                                                                  attribute.
3127     ".workgroup_size_hint"              sequence of              The dispatch work-group size
3128                                         3 integers               X, Y, Z is likely to be the
3129                                                                  specified values.
3130
3131                                                                  Corresponds to the OpenCL
3132                                                                  ``work_group_size_hint``
3133                                                                  attribute.
3134     ".vec_type_hint"                    string                   The name of a scalar or vector
3135                                                                  type.
3136
3137                                                                  Corresponds to the OpenCL
3138                                                                  ``vec_type_hint`` attribute.
3139
3140     ".device_enqueue_symbol"            string                   The external symbol name
3141                                                                  associated with a kernel.
3142                                                                  OpenCL runtime allocates a
3143                                                                  global buffer for the symbol
3144                                                                  and saves the kernel's address
3145                                                                  to it, which is used for
3146                                                                  device side enqueueing. Only
3147                                                                  available for device side
3148                                                                  enqueued kernels.
3149     ".kernarg_segment_size"             integer        Required  The size in bytes of
3150                                                                  the kernarg segment
3151                                                                  that holds the values
3152                                                                  of the arguments to
3153                                                                  the kernel.
3154     ".group_segment_fixed_size"         integer        Required  The amount of group
3155                                                                  segment memory
3156                                                                  required by a
3157                                                                  work-group in
3158                                                                  bytes. This does not
3159                                                                  include any
3160                                                                  dynamically allocated
3161                                                                  group segment memory
3162                                                                  that may be added
3163                                                                  when the kernel is
3164                                                                  dispatched.
3165     ".private_segment_fixed_size"       integer        Required  The amount of fixed
3166                                                                  private address space
3167                                                                  memory required for a
3168                                                                  work-item in
3169                                                                  bytes. If the kernel
3170                                                                  uses a dynamic call
3171                                                                  stack then additional
3172                                                                  space must be added
3173                                                                  to this value for the
3174                                                                  call stack.
3175     ".kernarg_segment_align"            integer        Required  The maximum byte
3176                                                                  alignment of
3177                                                                  arguments in the
3178                                                                  kernarg segment. Must
3179                                                                  be a power of 2.
3180     ".wavefront_size"                   integer        Required  Wavefront size. Must
3181                                                                  be a power of 2.
3182     ".sgpr_count"                       integer        Required  Number of scalar
3183                                                                  registers required by a
3184                                                                  wavefront for
3185                                                                  GFX6-GFX9. A register
3186                                                                  is required if it is
3187                                                                  used explicitly, or
3188                                                                  if a higher numbered
3189                                                                  register is used
3190                                                                  explicitly. This
3191                                                                  includes the special
3192                                                                  SGPRs for VCC, Flat
3193                                                                  Scratch (GFX7-GFX9)
3194                                                                  and XNACK (for
3195                                                                  GFX8-GFX9). It does
3196                                                                  not include the 16
3197                                                                  SGPR added if a trap
3198                                                                  handler is
3199                                                                  enabled. It is not
3200                                                                  rounded up to the
3201                                                                  allocation
3202                                                                  granularity.
3203     ".vgpr_count"                       integer        Required  Number of vector
3204                                                                  registers required by
3205                                                                  each work-item for
3206                                                                  GFX6-GFX9. A register
3207                                                                  is required if it is
3208                                                                  used explicitly, or
3209                                                                  if a higher numbered
3210                                                                  register is used
3211                                                                  explicitly.
3212     ".agpr_count"                       integer        Required  Number of accumulator
3213                                                                  registers required by
3214                                                                  each work-item for
3215                                                                  GFX90A, GFX908.
3216     ".max_flat_workgroup_size"          integer        Required  Maximum flat
3217                                                                  work-group size
3218                                                                  supported by the
3219                                                                  kernel in work-items.
3220                                                                  Must be >=1 and
3221                                                                  consistent with
3222                                                                  ReqdWorkGroupSize if
3223                                                                  not 0, 0, 0.
3224     ".sgpr_spill_count"                 integer                  Number of stores from
3225                                                                  a scalar register to
3226                                                                  a register allocator
3227                                                                  created spill
3228                                                                  location.
3229     ".vgpr_spill_count"                 integer                  Number of stores from
3230                                                                  a vector register to
3231                                                                  a register allocator
3232                                                                  created spill
3233                                                                  location.
3234     ".kind"                             string                   The kind of the kernel
3235                                                                  with the following
3236                                                                  values:
3237
3238                                                                  "normal"
3239                                                                    Regular kernels.
3240
3241                                                                  "init"
3242                                                                    These kernels must be
3243                                                                    invoked after loading
3244                                                                    the containing code
3245                                                                    object and must
3246                                                                    complete before any
3247                                                                    normal and fini
3248                                                                    kernels in the same
3249                                                                    code object are
3250                                                                    invoked.
3251
3252                                                                  "fini"
3253                                                                    These kernels must be
3254                                                                    invoked before
3255                                                                    unloading the
3256                                                                    containing code object
3257                                                                    and after all init and
3258                                                                    normal kernels in the
3259                                                                    same code object have
3260                                                                    been invoked and
3261                                                                    completed.
3262
3263                                                                  If omitted, "normal" is
3264                                                                  assumed.
3265     =================================== ============== ========= ================================
3266
3267..
3268
3269  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3270     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3271
3272     ====================== ============== ========= ================================
3273     String Key             Value Type     Required? Description
3274     ====================== ============== ========= ================================
3275     ".name"                string                   Kernel argument name.
3276     ".type_name"           string                   Kernel argument type name.
3277     ".size"                integer        Required  Kernel argument size in bytes.
3278     ".offset"              integer        Required  Kernel argument offset in
3279                                                     bytes. The offset must be a
3280                                                     multiple of the alignment
3281                                                     required by the argument.
3282     ".value_kind"          string         Required  Kernel argument kind that
3283                                                     specifies how to set up the
3284                                                     corresponding argument.
3285                                                     Values include:
3286
3287                                                     "by_value"
3288                                                       The argument is copied
3289                                                       directly into the kernarg.
3290
3291                                                     "global_buffer"
3292                                                       A global address space pointer
3293                                                       to the buffer data is passed
3294                                                       in the kernarg.
3295
3296                                                     "dynamic_shared_pointer"
3297                                                       A group address space pointer
3298                                                       to dynamically allocated LDS
3299                                                       is passed in the kernarg.
3300
3301                                                     "sampler"
3302                                                       A global address space
3303                                                       pointer to a S# is passed in
3304                                                       the kernarg.
3305
3306                                                     "image"
3307                                                       A global address space
3308                                                       pointer to a T# is passed in
3309                                                       the kernarg.
3310
3311                                                     "pipe"
3312                                                       A global address space pointer
3313                                                       to an OpenCL pipe is passed in
3314                                                       the kernarg.
3315
3316                                                     "queue"
3317                                                       A global address space pointer
3318                                                       to an OpenCL device enqueue
3319                                                       queue is passed in the
3320                                                       kernarg.
3321
3322                                                     "hidden_global_offset_x"
3323                                                       The OpenCL grid dispatch
3324                                                       global offset for the X
3325                                                       dimension is passed in the
3326                                                       kernarg.
3327
3328                                                     "hidden_global_offset_y"
3329                                                       The OpenCL grid dispatch
3330                                                       global offset for the Y
3331                                                       dimension is passed in the
3332                                                       kernarg.
3333
3334                                                     "hidden_global_offset_z"
3335                                                       The OpenCL grid dispatch
3336                                                       global offset for the Z
3337                                                       dimension is passed in the
3338                                                       kernarg.
3339
3340                                                     "hidden_none"
3341                                                       An argument that is not used
3342                                                       by the kernel. Space needs to
3343                                                       be left for it, but it does
3344                                                       not need to be set up.
3345
3346                                                     "hidden_printf_buffer"
3347                                                       A global address space pointer
3348                                                       to the runtime printf buffer
3349                                                       is passed in kernarg.
3350
3351                                                     "hidden_hostcall_buffer"
3352                                                       A global address space pointer
3353                                                       to the runtime hostcall buffer
3354                                                       is passed in kernarg.
3355
3356                                                     "hidden_default_queue"
3357                                                       A global address space pointer
3358                                                       to the OpenCL device enqueue
3359                                                       queue that should be used by
3360                                                       the kernel by default is
3361                                                       passed in the kernarg.
3362
3363                                                     "hidden_completion_action"
3364                                                       A global address space pointer
3365                                                       to help link enqueued kernels into
3366                                                       the ancestor tree for determining
3367                                                       when the parent kernel has finished.
3368
3369                                                     "hidden_multigrid_sync_arg"
3370                                                       A global address space pointer for
3371                                                       multi-grid synchronization is
3372                                                       passed in the kernarg.
3373
3374     ".value_type"          string                    Unused and deprecated. This should no longer
3375                                                      be emitted, but is accepted for compatibility.
3376
3377     ".pointee_align"       integer                  Alignment in bytes of pointee
3378                                                     type for pointer type kernel
3379                                                     argument. Must be a power
3380                                                     of 2. Only present if
3381                                                     ".value_kind" is
3382                                                     "dynamic_shared_pointer".
3383     ".address_space"       string                   Kernel argument address space
3384                                                     qualifier. Only present if
3385                                                     ".value_kind" is "global_buffer" or
3386                                                     "dynamic_shared_pointer". Values
3387                                                     are:
3388
3389                                                     - "private"
3390                                                     - "global"
3391                                                     - "constant"
3392                                                     - "local"
3393                                                     - "generic"
3394                                                     - "region"
3395
3396                                                     .. TODO::
3397
3398                                                        Is "global_buffer" only "global"
3399                                                        or "constant"? Is
3400                                                        "dynamic_shared_pointer" always
3401                                                        "local"? Can HCC allow "generic"?
3402                                                        How can "private" or "region"
3403                                                        ever happen?
3404
3405     ".access"              string                   Kernel argument access
3406                                                     qualifier. Only present if
3407                                                     ".value_kind" is "image" or
3408                                                     "pipe". Values
3409                                                     are:
3410
3411                                                     - "read_only"
3412                                                     - "write_only"
3413                                                     - "read_write"
3414
3415                                                     .. TODO::
3416
3417                                                        Does this apply to
3418                                                        "global_buffer"?
3419
3420     ".actual_access"       string                   The actual memory accesses
3421                                                     performed by the kernel on the
3422                                                     kernel argument. Only present if
3423                                                     ".value_kind" is "global_buffer",
3424                                                     "image", or "pipe". This may be
3425                                                     more restrictive than indicated
3426                                                     by ".access" to reflect what the
3427                                                     kernel actual does. If not
3428                                                     present then the runtime must
3429                                                     assume what is implied by
3430                                                     ".access" and ".is_const"      . Values
3431                                                     are:
3432
3433                                                     - "read_only"
3434                                                     - "write_only"
3435                                                     - "read_write"
3436
3437     ".is_const"            boolean                  Indicates if the kernel argument
3438                                                     is const qualified. Only present
3439                                                     if ".value_kind" is
3440                                                     "global_buffer".
3441
3442     ".is_restrict"         boolean                  Indicates if the kernel argument
3443                                                     is restrict qualified. Only
3444                                                     present if ".value_kind" is
3445                                                     "global_buffer".
3446
3447     ".is_volatile"         boolean                  Indicates if the kernel argument
3448                                                     is volatile qualified. Only
3449                                                     present if ".value_kind" is
3450                                                     "global_buffer".
3451
3452     ".is_pipe"             boolean                  Indicates if the kernel argument
3453                                                     is pipe qualified. Only present
3454                                                     if ".value_kind" is "pipe".
3455
3456                                                     .. TODO::
3457
3458                                                        Can "global_buffer" be pipe
3459                                                        qualified?
3460
3461     ====================== ============== ========= ================================
3462
3463.. _amdgpu-amdhsa-code-object-metadata-v4:
3464
3465Code Object V4 Metadata
3466+++++++++++++++++++++++
3467
3468Code object V4 metadata is the same as
3469:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3470defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3471
3472  .. table:: AMDHSA Code Object V4 Metadata Map Changes
3473     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3474
3475     ================= ============== ========= =======================================
3476     String Key        Value Type     Required? Description
3477     ================= ============== ========= =======================================
3478     "amdhsa.version"  sequence of    Required  - The first integer is the major
3479                       2 integers                 version. Currently 1.
3480                                                - The second integer is the minor
3481                                                  version. Currently 1.
3482     "amdhsa.target"   string         Required  The target name of the code using the syntax:
3483
3484                                                .. code::
3485
3486                                                  <target-triple> [ "-" <target-id> ]
3487
3488                                                A canonical target ID must be
3489                                                used. See :ref:`amdgpu-target-triples`
3490                                                and :ref:`amdgpu-target-id`.
3491     ================= ============== ========= =======================================
3492
3493.. _amdgpu-amdhsa-code-object-metadata-v5:
3494
3495Code Object V5 Metadata
3496+++++++++++++++++++++++
3497
3498.. warning::
3499  Code object V5 is not the default code object version emitted by this version
3500  of LLVM.
3501
3502
3503Code object V5 metadata is the same as
3504:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3505:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table
3506:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3507
3508  .. table:: AMDHSA Code Object V5 Metadata Map Changes
3509     :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3510
3511     ================= ============== ========= =======================================
3512     String Key        Value Type     Required? Description
3513     ================= ============== ========= =======================================
3514     "amdhsa.version"  sequence of    Required  - The first integer is the major
3515                       2 integers                 version. Currently 1.
3516                                                - The second integer is the minor
3517                                                  version. Currently 2.
3518     ================= ============== ========= =======================================
3519
3520..
3521
3522  .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3523     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3524
3525     ====================== ============== ========= ================================
3526     String Key             Value Type     Required? Description
3527     ====================== ============== ========= ================================
3528     ".value_kind"          string         Required  Kernel argument kind that
3529                                                     specifies how to set up the
3530                                                     corresponding argument.
3531                                                     Values include:
3532                                                     the same as code object V3 metadata
3533                                                     (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3534                                                     with the following additions:
3535
3536                                                     "hidden_block_count_x"
3537                                                       The grid dispatch work-group count for the X dimension
3538                                                       is passed in the kernarg. Some languages, such as OpenCL,
3539                                                       support a last work-group in each dimension being partial.
3540                                                       This count only includes the non-partial work-group count.
3541                                                       This is not the same as the value in the AQL dispatch packet,
3542                                                       which has the grid size in work-items.
3543
3544                                                     "hidden_block_count_y"
3545                                                       The grid dispatch work-group count for the Y dimension
3546                                                       is passed in the kernarg. Some languages, such as OpenCL,
3547                                                       support a last work-group in each dimension being partial.
3548                                                       This count only includes the non-partial work-group count.
3549                                                       This is not the same as the value in the AQL dispatch packet,
3550                                                       which has the grid size in work-items. If the grid dimensionality
3551                                                       is 1, then must be 1.
3552
3553                                                     "hidden_block_count_z"
3554                                                       The grid dispatch work-group count for the Z dimension
3555                                                       is passed in the kernarg. Some languages, such as OpenCL,
3556                                                       support a last work-group in each dimension being partial.
3557                                                       This count only includes the non-partial work-group count.
3558                                                       This is not the same as the value in the AQL dispatch packet,
3559                                                       which has the grid size in work-items. If the grid dimensionality
3560                                                       is 1 or 2, then must be 1.
3561
3562                                                     "hidden_group_size_x"
3563                                                       The grid dispatch work-group size for the X dimension is
3564                                                       passed in the kernarg. This size only applies to the
3565                                                       non-partial work-groups. This is the same value as the AQL
3566                                                       dispatch packet work-group size.
3567
3568                                                     "hidden_group_size_y"
3569                                                       The grid dispatch work-group size for the Y dimension is
3570                                                       passed in the kernarg. This size only applies to the
3571                                                       non-partial work-groups. This is the same value as the AQL
3572                                                       dispatch packet work-group size. If the grid dimensionality
3573                                                       is 1, then must be 1.
3574
3575                                                     "hidden_group_size_z"
3576                                                       The grid dispatch work-group size for the Z dimension is
3577                                                       passed in the kernarg. This size only applies to the
3578                                                       non-partial work-groups. This is the same value as the AQL
3579                                                       dispatch packet work-group size. If the grid dimensionality
3580                                                       is 1 or 2, then must be 1.
3581
3582                                                     "hidden_remainder_x"
3583                                                       The grid dispatch work group size of the the partial work group
3584                                                       of the X dimension, if it exists. Must be zero if a partial
3585                                                       work group does not exist in the X dimension.
3586
3587                                                     "hidden_remainder_y"
3588                                                       The grid dispatch work group size of the the partial work group
3589                                                       of the Y dimension, if it exists. Must be zero if a partial
3590                                                       work group does not exist in the Y dimension.
3591
3592                                                     "hidden_remainder_z"
3593                                                       The grid dispatch work group size of the the partial work group
3594                                                       of the Z dimension, if it exists. Must be zero if a partial
3595                                                       work group does not exist in the Z dimension.
3596
3597                                                     "hidden_grid_dims"
3598                                                       The grid dispatch dimensionality. This is the same value
3599                                                       as the AQL dispatch packet dimensionality. Must be a value
3600                                                       between 1 and 3.
3601
3602                                                     "hidden_heap_v1"
3603                                                       A global address space pointer to an initialized memory
3604                                                       buffer that conforms to the requirements of the malloc/free
3605                                                       device library V1 version implementation.
3606
3607                                                     "hidden_private_base"
3608                                                       The high 32 bits of the flat addressing private aperture base.
3609                                                       Only used by GFX8 to allow conversion between private segment
3610                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3611
3612                                                     "hidden_shared_base"
3613                                                       The high 32 bits of the flat addressing shared aperture base.
3614                                                       Only used by GFX8 to allow conversion between shared segment
3615                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3616
3617                                                     "hidden_queue_ptr"
3618                                                       A global memory address space pointer to the ROCm runtime
3619                                                       ``struct amd_queue_t`` structure for the HSA queue of the
3620                                                       associated dispatch AQL packet. It is only required for pre-GFX9
3621                                                       devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3622
3623     ====================== ============== ========= ================================
3624
3625..
3626
3627Kernel Dispatch
3628~~~~~~~~~~~~~~~
3629
3630The HSA architected queuing language (AQL) defines a user space memory interface
3631that can be used to control the dispatch of kernels, in an agent independent
3632way. An agent can have zero or more AQL queues created for it using an HSA
3633compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3634are 64 bytes) can be placed. See the *HSA Platform System Architecture
3635Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3636
3637The packet processor of a kernel agent is responsible for detecting and
3638dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3639packet processor is implemented by the hardware command processor (CP),
3640asynchronous dispatch controller (ADC) and shader processor input controller
3641(SPI).
3642
3643An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3644the kernel mode driver to initialize and register the AQL queue with CP.
3645
3646To dispatch a kernel the following actions are performed. This can occur in the
3647CPU host program, or from an HSA kernel executing on a GPU.
3648
36491. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3650   executed is obtained.
36512. A pointer to the kernel descriptor (see
3652   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3653   It must be for a kernel that is contained in a code object that was loaded
3654   by an HSA compatible runtime on the kernel agent with which the AQL queue is
3655   associated.
36563. Space is allocated for the kernel arguments using the HSA compatible runtime
3657   allocator for a memory region with the kernarg property for the kernel agent
3658   that will execute the kernel. It must be at least 16-byte aligned.
36594. Kernel argument values are assigned to the kernel argument memory
3660   allocation. The layout is defined in the *HSA Programmer's Language
3661   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3662   kernel argument memory in the same way constant memory is accessed. (Note
3663   that the HSA specification allows an implementation to copy the kernel
3664   argument contents to another location that is accessed by the kernel.)
36655. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3666   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3667   for the packet. The packet must be set up, and the final write must use an
3668   atomic store release to set the packet kind to ensure the packet contents are
3669   visible to the kernel agent. AQL defines a doorbell signal mechanism to
3670   notify the kernel agent that the AQL queue has been updated. These rules, and
3671   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3672   System Architecture Specification* [HSA]_.
36736. A kernel dispatch packet includes information about the actual dispatch,
3674   such as grid and work-group size, together with information from the code
3675   object about the kernel, such as segment sizes. The HSA compatible runtime
3676   queries on the kernel symbol can be used to obtain the code object values
3677   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
36787. CP executes micro-code and is responsible for detecting and setting up the
3679   GPU to execute the wavefronts of a kernel dispatch.
36808. CP ensures that when the a wavefront starts executing the kernel machine
3681   code, the scalar general purpose registers (SGPR) and vector general purpose
3682   registers (VGPR) are set up as required by the machine code. The required
3683   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3684   register state is defined in
3685   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
36869. The prolog of the kernel machine code (see
3687   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3688   before continuing executing the machine code that corresponds to the kernel.
368910. When the kernel dispatch has completed execution, CP signals the completion
3690    signal specified in the kernel dispatch packet if not 0.
3691
3692.. _amdgpu-amdhsa-memory-spaces:
3693
3694Memory Spaces
3695~~~~~~~~~~~~~
3696
3697The memory space properties are:
3698
3699  .. table:: AMDHSA Memory Spaces
3700     :name: amdgpu-amdhsa-memory-spaces-table
3701
3702     ================= =========== ======== ======= ==================
3703     Memory Space Name HSA Segment Hardware Address NULL Value
3704                       Name        Name     Size
3705     ================= =========== ======== ======= ==================
3706     Private           private     scratch  32      0x00000000
3707     Local             group       LDS      32      0xFFFFFFFF
3708     Global            global      global   64      0x0000000000000000
3709     Constant          constant    *same as 64      0x0000000000000000
3710                                   global*
3711     Generic           flat        flat     64      0x0000000000000000
3712     Region            N/A         GDS      32      *not implemented
3713                                                    for AMDHSA*
3714     ================= =========== ======== ======= ==================
3715
3716The global and constant memory spaces both use global virtual addresses, which
3717are the same virtual address space used by the CPU. However, some virtual
3718addresses may only be accessible to the CPU, some only accessible by the GPU,
3719and some by both.
3720
3721Using the constant memory space indicates that the data will not change during
3722the execution of the kernel. This allows scalar read instructions to be
3723used. The vector and scalar L1 caches are invalidated of volatile data before
3724each kernel dispatch execution to allow constant memory to change values between
3725kernel dispatches.
3726
3727The local memory space uses the hardware Local Data Store (LDS) which is
3728automatically allocated when the hardware creates work-groups of wavefronts, and
3729freed when all the wavefronts of a work-group have terminated. The data store
3730(DS) instructions can be used to access it.
3731
3732The private memory space uses the hardware scratch memory support. If the kernel
3733uses scratch, then the hardware allocates memory that is accessed using
3734wavefront lane dword (4 byte) interleaving. The mapping used from private
3735address to physical address is:
3736
3737  ``wavefront-scratch-base +
3738  (private-address * wavefront-size * 4) +
3739  (wavefront-lane-id * 4)``
3740
3741There are different ways that the wavefront scratch base address is determined
3742by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3743memory can be accessed in an interleaved manner using buffer instruction with
3744the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3745instructions, or by flat instructions. If each lane of a wavefront accesses the
3746same private address, the interleaving results in adjacent dwords being accessed
3747and hence requires fewer cache lines to be fetched. Multi-dword access is not
3748supported except by flat and scratch instructions in GFX9-GFX10.
3749
3750The generic address space uses the hardware flat address support available in
3751GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3752local apertures), that are outside the range of addressible global memory, to
3753map from a flat address to a private or local address.
3754
3755FLAT instructions can take a flat address and access global, private (scratch)
3756and group (LDS) memory depending on if the address is within one of the
3757aperture ranges. Flat access to scratch requires hardware aperture setup and
3758setup in the kernel prologue (see
3759:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3760hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3761:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3762
3763To convert between a segment address and a flat address the base address of the
3764apertures address can be used. For GFX7-GFX8 these are available in the
3765:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3766Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3767GFX9-GFX10 the aperture base addresses are directly available as inline constant
3768registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3769address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3770which makes it easier to convert from flat to segment or segment to flat.
3771
3772Image and Samplers
3773~~~~~~~~~~~~~~~~~~
3774
3775Image and sample handles created by an HSA compatible runtime (see
3776:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3777object respectively. In order to support the HSA ``query_sampler`` operations
3778two extra dwords are used to store the HSA BRIG enumeration values for the
3779queries that are not trivially deducible from the S# representation.
3780
3781HSA Signals
3782~~~~~~~~~~~
3783
3784HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3785are 64-bit addresses of a structure allocated in memory accessible from both the
3786CPU and GPU. The structure is defined by the runtime and subject to change
3787between releases. For example, see [AMD-ROCm-github]_.
3788
3789.. _amdgpu-amdhsa-hsa-aql-queue:
3790
3791HSA AQL Queue
3792~~~~~~~~~~~~~
3793
3794The HSA AQL queue structure is defined by an HSA compatible runtime (see
3795:ref:`amdgpu-os`) and subject to change between releases. For example, see
3796[AMD-ROCm-github]_. For some processors it contains fields needed to implement
3797certain language features such as the flat address aperture bases. It also
3798contains fields used by CP such as managing the allocation of scratch memory.
3799
3800.. _amdgpu-amdhsa-kernel-descriptor:
3801
3802Kernel Descriptor
3803~~~~~~~~~~~~~~~~~
3804
3805A kernel descriptor consists of the information needed by CP to initiate the
3806execution of a kernel, including the entry point address of the machine code
3807that implements the kernel.
3808
3809Code Object V3 Kernel Descriptor
3810++++++++++++++++++++++++++++++++
3811
3812CP microcode requires the Kernel descriptor to be allocated on 64-byte
3813alignment.
3814
3815The fields used by CP for code objects before V3 also match those specified in
3816:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3817
3818  .. table:: Code Object V3 Kernel Descriptor
3819     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3820
3821     ======= ======= =============================== ============================
3822     Bits    Size    Field Name                      Description
3823     ======= ======= =============================== ============================
3824     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3825                                                     address space memory
3826                                                     required for a work-group
3827                                                     in bytes. This does not
3828                                                     include any dynamically
3829                                                     allocated local address
3830                                                     space memory that may be
3831                                                     added when the kernel is
3832                                                     dispatched.
3833     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3834                                                     private address space
3835                                                     memory required for a
3836                                                     work-item in bytes.
3837                                                     Additional space may need to
3838                                                     be added to this value if
3839                                                     the call stack has
3840                                                     non-inlined function calls.
3841     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3842                                                     memory pointed to by the
3843                                                     AQL dispatch packet. The
3844                                                     kernarg memory is used to
3845                                                     pass arguments to the
3846                                                     kernel.
3847
3848                                                     * If the kernarg pointer in
3849                                                       the dispatch packet is NULL
3850                                                       then there are no kernel
3851                                                       arguments.
3852                                                     * If the kernarg pointer in
3853                                                       the dispatch packet is
3854                                                       not NULL and this value
3855                                                       is 0 then the kernarg
3856                                                       memory size is
3857                                                       unspecified.
3858                                                     * If the kernarg pointer in
3859                                                       the dispatch packet is
3860                                                       not NULL and this value
3861                                                       is not 0 then the value
3862                                                       specifies the kernarg
3863                                                       memory size in bytes. It
3864                                                       is recommended to provide
3865                                                       a value as it may be used
3866                                                       by CP to optimize making
3867                                                       the kernarg memory
3868                                                       visible to the kernel
3869                                                       code.
3870
3871     127:96  4 bytes                                 Reserved, must be 0.
3872     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3873                                                     negative) from base
3874                                                     address of kernel
3875                                                     descriptor to kernel's
3876                                                     entry point instruction
3877                                                     which must be 256 byte
3878                                                     aligned.
3879     351:272 20                                      Reserved, must be 0.
3880             bytes
3881     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3882                                                       Reserved, must be 0.
3883                                                     GFX90A, GFX940
3884                                                       Compute Shader (CS)
3885                                                       program settings used by
3886                                                       CP to set up
3887                                                       ``COMPUTE_PGM_RSRC3``
3888                                                       configuration
3889                                                       register. See
3890                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3891                                                     GFX10
3892                                                       Compute Shader (CS)
3893                                                       program settings used by
3894                                                       CP to set up
3895                                                       ``COMPUTE_PGM_RSRC3``
3896                                                       configuration
3897                                                       register. See
3898                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3899     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3900                                                     program settings used by
3901                                                     CP to set up
3902                                                     ``COMPUTE_PGM_RSRC1``
3903                                                     configuration
3904                                                     register. See
3905                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3906     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3907                                                     program settings used by
3908                                                     CP to set up
3909                                                     ``COMPUTE_PGM_RSRC2``
3910                                                     configuration
3911                                                     register. See
3912                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3913     458:448 7 bits  *See separate bits below.*      Enable the setup of the
3914                                                     SGPR user data registers
3915                                                     (see
3916                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3917
3918                                                     The total number of SGPR
3919                                                     user data registers
3920                                                     requested must not exceed
3921                                                     16 and match value in
3922                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3923                                                     Any requests beyond 16
3924                                                     will be ignored.
3925     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3926                     _BUFFER                         column of
3927                                                     :ref:`amdgpu-processor-table`
3928                                                     specifies *Architected flat
3929                                                     scratch* then not supported
3930                                                     and must be 0,
3931     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3932     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3933     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3934     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3935     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3936                                                     column of
3937                                                     :ref:`amdgpu-processor-table`
3938                                                     specifies *Architected flat
3939                                                     scratch* then not supported
3940                                                     and must be 0,
3941     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3942                     _SIZE
3943     457:455 3 bits                                  Reserved, must be 0.
3944     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3945                                                       Reserved, must be 0.
3946                                                     GFX10
3947                                                       - If 0 execute in
3948                                                         wavefront size 64 mode.
3949                                                       - If 1 execute in
3950                                                         native wavefront size
3951                                                         32 mode.
3952     463:459 1 bit                                   Reserved, must be 0.
3953     464     1 bit   RESERVED_464                    Deprecated, must be 0.
3954     467:465 3 bits                                  Reserved, must be 0.
3955     468     1 bit   RESERVED_468                    Deprecated, must be 0.
3956     469:471 3 bits                                  Reserved, must be 0.
3957     511:472 5 bytes                                 Reserved, must be 0.
3958     512     **Total size 64 bytes.**
3959     ======= ====================================================================
3960
3961..
3962
3963  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3964     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3965
3966     ======= ======= =============================== ===========================================================================
3967     Bits    Size    Field Name                      Description
3968     ======= ======= =============================== ===========================================================================
3969     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
3970                                                     blocks used by each work-item;
3971                                                     granularity is device
3972                                                     specific:
3973
3974                                                     GFX6-GFX9
3975                                                       - vgprs_used 0..256
3976                                                       - max(0, ceil(vgprs_used / 4) - 1)
3977                                                     GFX90A, GFX940
3978                                                       - vgprs_used 0..512
3979                                                       - vgprs_used = align(arch_vgprs, 4)
3980                                                                      + acc_vgprs
3981                                                       - max(0, ceil(vgprs_used / 8) - 1)
3982                                                     GFX10 (wavefront size 64)
3983                                                       - max_vgpr 1..256
3984                                                       - max(0, ceil(vgprs_used / 4) - 1)
3985                                                     GFX10 (wavefront size 32)
3986                                                       - max_vgpr 1..256
3987                                                       - max(0, ceil(vgprs_used / 8) - 1)
3988
3989                                                     Where vgprs_used is defined
3990                                                     as the highest VGPR number
3991                                                     explicitly referenced plus
3992                                                     one.
3993
3994                                                     Used by CP to set up
3995                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
3996
3997                                                     The
3998                                                     :ref:`amdgpu-assembler`
3999                                                     calculates this
4000                                                     automatically for the
4001                                                     selected processor from
4002                                                     values provided to the
4003                                                     `.amdhsa_kernel` directive
4004                                                     by the
4005                                                     `.amdhsa_next_free_vgpr`
4006                                                     nested directive (see
4007                                                     :ref:`amdhsa-kernel-directives-table`).
4008     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4009                                                     blocks used by a wavefront;
4010                                                     granularity is device
4011                                                     specific:
4012
4013                                                     GFX6-GFX8
4014                                                       - sgprs_used 0..112
4015                                                       - max(0, ceil(sgprs_used / 8) - 1)
4016                                                     GFX9
4017                                                       - sgprs_used 0..112
4018                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
4019                                                     GFX10
4020                                                       Reserved, must be 0.
4021                                                       (128 SGPRs always
4022                                                       allocated.)
4023
4024                                                     Where sgprs_used is
4025                                                     defined as the highest
4026                                                     SGPR number explicitly
4027                                                     referenced plus one, plus
4028                                                     a target specific number
4029                                                     of additional special
4030                                                     SGPRs for VCC,
4031                                                     FLAT_SCRATCH (GFX7+) and
4032                                                     XNACK_MASK (GFX8+), and
4033                                                     any additional
4034                                                     target specific
4035                                                     limitations. It does not
4036                                                     include the 16 SGPRs added
4037                                                     if a trap handler is
4038                                                     enabled.
4039
4040                                                     The target specific
4041                                                     limitations and special
4042                                                     SGPR layout are defined in
4043                                                     the hardware
4044                                                     documentation, which can
4045                                                     be found in the
4046                                                     :ref:`amdgpu-processors`
4047                                                     table.
4048
4049                                                     Used by CP to set up
4050                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
4051
4052                                                     The
4053                                                     :ref:`amdgpu-assembler`
4054                                                     calculates this
4055                                                     automatically for the
4056                                                     selected processor from
4057                                                     values provided to the
4058                                                     `.amdhsa_kernel` directive
4059                                                     by the
4060                                                     `.amdhsa_next_free_sgpr`
4061                                                     and `.amdhsa_reserve_*`
4062                                                     nested directives (see
4063                                                     :ref:`amdhsa-kernel-directives-table`).
4064     11:10   2 bits  PRIORITY                        Must be 0.
4065
4066                                                     Start executing wavefront
4067                                                     at the specified priority.
4068
4069                                                     CP is responsible for
4070                                                     filling in
4071                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
4072     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
4073                                                     with specified rounding
4074                                                     mode for single (32
4075                                                     bit) floating point
4076                                                     precision floating point
4077                                                     operations.
4078
4079                                                     Floating point rounding
4080                                                     mode values are defined in
4081                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4082
4083                                                     Used by CP to set up
4084                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4085     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
4086                                                     with specified rounding
4087                                                     denorm mode for half/double (16
4088                                                     and 64-bit) floating point
4089                                                     precision floating point
4090                                                     operations.
4091
4092                                                     Floating point rounding
4093                                                     mode values are defined in
4094                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4095
4096                                                     Used by CP to set up
4097                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4098     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
4099                                                     with specified denorm mode
4100                                                     for single (32
4101                                                     bit)  floating point
4102                                                     precision floating point
4103                                                     operations.
4104
4105                                                     Floating point denorm mode
4106                                                     values are defined in
4107                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4108
4109                                                     Used by CP to set up
4110                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4111     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
4112                                                     with specified denorm mode
4113                                                     for half/double (16
4114                                                     and 64-bit) floating point
4115                                                     precision floating point
4116                                                     operations.
4117
4118                                                     Floating point denorm mode
4119                                                     values are defined in
4120                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4121
4122                                                     Used by CP to set up
4123                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4124     20      1 bit   PRIV                            Must be 0.
4125
4126                                                     Start executing wavefront
4127                                                     in privilege trap handler
4128                                                     mode.
4129
4130                                                     CP is responsible for
4131                                                     filling in
4132                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
4133     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
4134                                                     with DX10 clamp mode
4135                                                     enabled. Used by the vector
4136                                                     ALU to force DX10 style
4137                                                     treatment of NaN's (when
4138                                                     set, clamp NaN to zero,
4139                                                     otherwise pass NaN
4140                                                     through).
4141
4142                                                     Used by CP to set up
4143                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4144     22      1 bit   DEBUG_MODE                      Must be 0.
4145
4146                                                     Start executing wavefront
4147                                                     in single step mode.
4148
4149                                                     CP is responsible for
4150                                                     filling in
4151                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4152     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
4153                                                     with IEEE mode
4154                                                     enabled. Floating point
4155                                                     opcodes that support
4156                                                     exception flag gathering
4157                                                     will quiet and propagate
4158                                                     signaling-NaN inputs per
4159                                                     IEEE 754-2008. Min_dx10 and
4160                                                     max_dx10 become IEEE
4161                                                     754-2008 compliant due to
4162                                                     signaling-NaN propagation
4163                                                     and quieting.
4164
4165                                                     Used by CP to set up
4166                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4167     24      1 bit   BULKY                           Must be 0.
4168
4169                                                     Only one work-group allowed
4170                                                     to execute on a compute
4171                                                     unit.
4172
4173                                                     CP is responsible for
4174                                                     filling in
4175                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
4176     25      1 bit   CDBG_USER                       Must be 0.
4177
4178                                                     Flag that can be used to
4179                                                     control debugging code.
4180
4181                                                     CP is responsible for
4182                                                     filling in
4183                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4184     26      1 bit   FP16_OVFL                       GFX6-GFX8
4185                                                       Reserved, must be 0.
4186                                                     GFX9-GFX10
4187                                                       Wavefront starts execution
4188                                                       with specified fp16 overflow
4189                                                       mode.
4190
4191                                                       - If 0, fp16 overflow generates
4192                                                         +/-INF values.
4193                                                       - If 1, fp16 overflow that is the
4194                                                         result of an +/-INF input value
4195                                                         or divide by 0 produces a +/-INF,
4196                                                         otherwise clamps computed
4197                                                         overflow to +/-MAX_FP16 as
4198                                                         appropriate.
4199
4200                                                       Used by CP to set up
4201                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4202     28:27   2 bits                                  Reserved, must be 0.
4203     29      1 bit    WGP_MODE                       GFX6-GFX9
4204                                                       Reserved, must be 0.
4205                                                     GFX10
4206                                                       - If 0 execute work-groups in
4207                                                         CU wavefront execution mode.
4208                                                       - If 1 execute work-groups on
4209                                                         in WGP wavefront execution mode.
4210
4211                                                       See :ref:`amdgpu-amdhsa-memory-model`.
4212
4213                                                       Used by CP to set up
4214                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4215     30      1 bit    MEM_ORDERED                    GFX6-GFX9
4216                                                       Reserved, must be 0.
4217                                                     GFX10
4218                                                       Controls the behavior of the
4219                                                       s_waitcnt's vmcnt and vscnt
4220                                                       counters.
4221
4222                                                       - If 0 vmcnt reports completion
4223                                                         of load and atomic with return
4224                                                         out of order with sample
4225                                                         instructions, and the vscnt
4226                                                         reports the completion of
4227                                                         store and atomic without
4228                                                         return in order.
4229                                                       - If 1 vmcnt reports completion
4230                                                         of load, atomic with return
4231                                                         and sample instructions in
4232                                                         order, and the vscnt reports
4233                                                         the completion of store and
4234                                                         atomic without return in order.
4235
4236                                                       Used by CP to set up
4237                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4238     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4239                                                       Reserved, must be 0.
4240                                                     GFX10
4241                                                       - If 0 execute SIMD wavefronts
4242                                                         using oldest first policy.
4243                                                       - If 1 execute SIMD wavefronts to
4244                                                         ensure wavefronts will make some
4245                                                         forward progress.
4246
4247                                                       Used by CP to set up
4248                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4249     32      **Total size 4 bytes**
4250     ======= ===================================================================================================================
4251
4252..
4253
4254  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4255     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4256
4257     ======= ======= =============================== ===========================================================================
4258     Bits    Size    Field Name                      Description
4259     ======= ======= =============================== ===========================================================================
4260     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4261                                                       private segment.
4262                                                     * If the *Target Properties*
4263                                                       column of
4264                                                       :ref:`amdgpu-processor-table`
4265                                                       does not specify
4266                                                       *Architected flat
4267                                                       scratch* then enable the
4268                                                       setup of the SGPR
4269                                                       wavefront scratch offset
4270                                                       system register (see
4271                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4272                                                     * If the *Target Properties*
4273                                                       column of
4274                                                       :ref:`amdgpu-processor-table`
4275                                                       specifies *Architected
4276                                                       flat scratch* then enable
4277                                                       the setup of the
4278                                                       FLAT_SCRATCH register
4279                                                       pair (see
4280                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4281
4282                                                     Used by CP to set up
4283                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4284     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4285                                                     user data
4286                                                     registers requested. This
4287                                                     number must be greater than
4288                                                     or equal to the number of user
4289                                                     data registers enabled.
4290
4291                                                     Used by CP to set up
4292                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4293     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4294
4295                                                     This bit represents
4296                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4297                                                     which is set by the CP if
4298                                                     the runtime has installed a
4299                                                     trap handler.
4300     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4301                                                     system SGPR register for
4302                                                     the work-group id in the X
4303                                                     dimension (see
4304                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4305
4306                                                     Used by CP to set up
4307                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4308     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4309                                                     system SGPR register for
4310                                                     the work-group id in the Y
4311                                                     dimension (see
4312                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4313
4314                                                     Used by CP to set up
4315                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4316     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4317                                                     system SGPR register for
4318                                                     the work-group id in the Z
4319                                                     dimension (see
4320                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4321
4322                                                     Used by CP to set up
4323                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4324     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4325                                                     system SGPR register for
4326                                                     work-group information (see
4327                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4328
4329                                                     Used by CP to set up
4330                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4331     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4332                                                     VGPR system registers used
4333                                                     for the work-item ID.
4334                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4335                                                     defines the values.
4336
4337                                                     Used by CP to set up
4338                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4339     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4340
4341                                                     Wavefront starts execution
4342                                                     with address watch
4343                                                     exceptions enabled which
4344                                                     are generated when L1 has
4345                                                     witnessed a thread access
4346                                                     an *address of
4347                                                     interest*.
4348
4349                                                     CP is responsible for
4350                                                     filling in the address
4351                                                     watch bit in
4352                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4353                                                     according to what the
4354                                                     runtime requests.
4355     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4356
4357                                                     Wavefront starts execution
4358                                                     with memory violation
4359                                                     exceptions exceptions
4360                                                     enabled which are generated
4361                                                     when a memory violation has
4362                                                     occurred for this wavefront from
4363                                                     L1 or LDS
4364                                                     (write-to-read-only-memory,
4365                                                     mis-aligned atomic, LDS
4366                                                     address out of range,
4367                                                     illegal address, etc.).
4368
4369                                                     CP sets the memory
4370                                                     violation bit in
4371                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4372                                                     according to what the
4373                                                     runtime requests.
4374     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4375
4376                                                     CP uses the rounded value
4377                                                     from the dispatch packet,
4378                                                     not this value, as the
4379                                                     dispatch may contain
4380                                                     dynamically allocated group
4381                                                     segment memory. CP writes
4382                                                     directly to
4383                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4384
4385                                                     Amount of group segment
4386                                                     (LDS) to allocate for each
4387                                                     work-group. Granularity is
4388                                                     device specific:
4389
4390                                                     GFX6
4391                                                       roundup(lds-size / (64 * 4))
4392                                                     GFX7-GFX10
4393                                                       roundup(lds-size / (128 * 4))
4394
4395     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4396                     _INVALID_OPERATION              with specified exceptions
4397                                                     enabled.
4398
4399                                                     Used by CP to set up
4400                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
4401                                                     (set from bits 0..6).
4402
4403                                                     IEEE 754 FP Invalid
4404                                                     Operation
4405     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4406                     _SOURCE                         input operands is a
4407                                                     denormal number
4408     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4409                     _DIVISION_BY_ZERO               Zero
4410     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4411                     _OVERFLOW
4412     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4413                     _UNDERFLOW
4414     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4415                     _INEXACT
4416     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4417                     _ZERO                           (rcp_iflag_f32 instruction
4418                                                     only)
4419     31      1 bit                                   Reserved, must be 0.
4420     32      **Total size 4 bytes.**
4421     ======= ===================================================================================================================
4422
4423..
4424
4425  .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4426     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4427
4428     ======= ======= =============================== ===========================================================================
4429     Bits    Size    Field Name                      Description
4430     ======= ======= =============================== ===========================================================================
4431     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4432                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4433                                                     63 - accum-offset = 256.
4434     6:15    10                                      Reserved, must be 0.
4435             bits
4436     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4437                                                       launched in the same CU.
4438                                                     - If 1 the waves of a work-group can be
4439                                                       launched in different CUs. The waves
4440                                                       cannot use S_BARRIER or LDS.
4441     17:31   15                                      Reserved, must be 0.
4442             bits
4443     32      **Total size 4 bytes.**
4444     ======= ===================================================================================================================
4445
4446..
4447
4448  .. table:: compute_pgm_rsrc3 for GFX10
4449     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4450
4451     ======= ======= =============================== ===========================================================================
4452     Bits    Size    Field Name                      Description
4453     ======= ======= =============================== ===========================================================================
4454     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
4455                                                     wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4456                                                     of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4457                                                     not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4458     31:4    28                                      Reserved, must be 0.
4459             bits
4460     32      **Total size 4 bytes.**
4461     ======= ===================================================================================================================
4462
4463..
4464
4465  .. table:: Floating Point Rounding Mode Enumeration Values
4466     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4467
4468     ====================================== ===== ==============================
4469     Enumeration Name                       Value Description
4470     ====================================== ===== ==============================
4471     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4472     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4473     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4474     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4475     ====================================== ===== ==============================
4476
4477..
4478
4479  .. table:: Floating Point Denorm Mode Enumeration Values
4480     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4481
4482     ====================================== ===== ==============================
4483     Enumeration Name                       Value Description
4484     ====================================== ===== ==============================
4485     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4486                                                  Denorms
4487     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4488     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4489     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4490     ====================================== ===== ==============================
4491
4492..
4493
4494  .. table:: System VGPR Work-Item ID Enumeration Values
4495     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4496
4497     ======================================== ===== ============================
4498     Enumeration Name                         Value Description
4499     ======================================== ===== ============================
4500     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4501                                                    ID.
4502     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4503                                                    dimensions ID.
4504     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4505                                                    dimensions ID.
4506     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4507     ======================================== ===== ============================
4508
4509.. _amdgpu-amdhsa-initial-kernel-execution-state:
4510
4511Initial Kernel Execution State
4512~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4513
4514This section defines the register state that will be set up by the packet
4515processor prior to the start of execution of every wavefront. This is limited by
4516the constraints of the hardware controllers of CP/ADC/SPI.
4517
4518The order of the SGPR registers is defined, but the compiler can specify which
4519ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4520fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4521for enabled registers are dense starting at SGPR0: the first enabled register is
4522SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4523an SGPR number.
4524
4525The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4526all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4527using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4528actually initialized. These are then immediately followed by the System SGPRs
4529that are set up by ADC/SPI and can have different values for each wavefront of
4530the grid dispatch.
4531
4532SGPR register initial state is defined in
4533:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4534
4535  .. table:: SGPR Register Set Up Order
4536     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4537
4538     ========== ========================== ====== ==============================
4539     SGPR Order Name                       Number Description
4540                (kernel descriptor enable  of
4541                field)                     SGPRs
4542     ========== ========================== ====== ==============================
4543     First      Private Segment Buffer     4      See
4544                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4545                _segment_buffer)
4546     then       Dispatch Ptr               2      64-bit address of AQL dispatch
4547                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4548                                                  actually executing.
4549     then       Queue Ptr                  2      64-bit address of amd_queue_t
4550                (enable_sgpr_queue_ptr)           object for AQL queue on which
4551                                                  the dispatch packet was
4552                                                  queued.
4553     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4554                (enable_sgpr_kernarg              segment. This is directly
4555                _segment_ptr)                     copied from the
4556                                                  kernarg_address in the kernel
4557                                                  dispatch packet.
4558
4559                                                  Having CP load it once avoids
4560                                                  loading it at the beginning of
4561                                                  every wavefront.
4562     then       Dispatch Id                2      64-bit Dispatch ID of the
4563                (enable_sgpr_dispatch_id)         dispatch packet being
4564                                                  executed.
4565     then       Flat Scratch Init          2      See
4566                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4567                _init)
4568     then       Private Segment Size       1      The 32-bit byte size of a
4569                (enable_sgpr_private              single work-item's memory
4570                _segment_size)                    allocation. This is the
4571                                                  value from the kernel
4572                                                  dispatch packet Private
4573                                                  Segment Byte Size rounded up
4574                                                  by CP to a multiple of
4575                                                  DWORD.
4576
4577                                                  Having CP load it once avoids
4578                                                  loading it at the beginning of
4579                                                  every wavefront.
4580
4581                                                  This is not used for
4582                                                  GFX7-GFX8 since it is the same
4583                                                  value as the second SGPR of
4584                                                  Flat Scratch Init. However, it
4585                                                  may be needed for GFX9-GFX10 which
4586                                                  changes the meaning of the
4587                                                  Flat Scratch Init value.
4588     then       Work-Group Id X            1      32-bit work-group id in X
4589                (enable_sgpr_workgroup_id         dimension of grid for
4590                _X)                               wavefront.
4591     then       Work-Group Id Y            1      32-bit work-group id in Y
4592                (enable_sgpr_workgroup_id         dimension of grid for
4593                _Y)                               wavefront.
4594     then       Work-Group Id Z            1      32-bit work-group id in Z
4595                (enable_sgpr_workgroup_id         dimension of grid for
4596                _Z)                               wavefront.
4597     then       Work-Group Info            1      {first_wavefront, 14'b0000,
4598                (enable_sgpr_workgroup            ordered_append_term[10:0],
4599                _info)                            threadgroup_size_in_wavefronts[5:0]}
4600     then       Scratch Wavefront Offset   1      See
4601                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4602                _segment_wavefront_offset)        and
4603                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4604     ========== ========================== ====== ==============================
4605
4606The order of the VGPR registers is defined, but the compiler can specify which
4607ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4608fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4609for enabled registers are dense starting at VGPR0: the first enabled register is
4610VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4611VGPR number.
4612
4613There are different methods used for the VGPR initial state:
4614
4615* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4616  specifies otherwise, a separate VGPR register is used per work-item ID. The
4617  VGPR register initial state for this method is defined in
4618  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4619* If *Target Properties* column of :ref:`amdgpu-processor-table`
4620  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4621  for all work-item IDs. The register layout for this method is defined in
4622  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4623
4624  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4625     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4626
4627     ========== ========================== ====== ==============================
4628     VGPR Order Name                       Number Description
4629                (kernel descriptor enable  of
4630                field)                     VGPRs
4631     ========== ========================== ====== ==============================
4632     First      Work-Item Id X             1      32-bit work-item id in X
4633                (Always initialized)              dimension of work-group for
4634                                                  wavefront lane.
4635     then       Work-Item Id Y             1      32-bit work-item id in Y
4636                (enable_vgpr_workitem_id          dimension of work-group for
4637                > 0)                              wavefront lane.
4638     then       Work-Item Id Z             1      32-bit work-item id in Z
4639                (enable_vgpr_workitem_id          dimension of work-group for
4640                > 1)                              wavefront lane.
4641     ========== ========================== ====== ==============================
4642
4643..
4644
4645  .. table:: Register Layout for Packed Work-Item ID Method
4646     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4647
4648     ======= ======= ================ =========================================
4649     Bits    Size    Field Name       Description
4650     ======= ======= ================ =========================================
4651     0:9     10 bits Work-Item Id X   Work-item id in X
4652                                      dimension of work-group for
4653                                      wavefront lane.
4654
4655                                      Always initialized.
4656
4657     10:19   10 bits Work-Item Id Y   Work-item id in Y
4658                                      dimension of work-group for
4659                                      wavefront lane.
4660
4661                                      Initialized if enable_vgpr_workitem_id >
4662                                      0, otherwise set to 0.
4663     20:29   10 bits Work-Item Id Z   Work-item id in Z
4664                                      dimension of work-group for
4665                                      wavefront lane.
4666
4667                                      Initialized if enable_vgpr_workitem_id >
4668                                      1, otherwise set to 0.
4669     30:31   2 bits                   Reserved, set to 0.
4670     ======= ======= ================ =========================================
4671
4672The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4673
46741. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4675   registers.
46762. Work-group Id registers X, Y, Z are set by ADC which supports any
4677   combination including none.
46783. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4679   its value cannot be included with the flat scratch init value which is per
4680   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
46814. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4682   or (X, Y, Z).
46835. Flat Scratch register pair initialization is described in
4684   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4685
4686The global segment can be accessed either using buffer instructions (GFX6 which
4687has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4688instructions (GFX9-GFX10).
4689
4690If buffer operations are used, then the compiler can generate a V# with the
4691following properties:
4692
4693* base address of 0
4694* no swizzle
4695* ATC: 1 if IOMMU present (such as APU)
4696* ptr64: 1
4697* MTYPE set to support memory coherence that matches the runtime (such as CC for
4698  APU and NC for dGPU).
4699
4700.. _amdgpu-amdhsa-kernel-prolog:
4701
4702Kernel Prolog
4703~~~~~~~~~~~~~
4704
4705The compiler performs initialization in the kernel prologue depending on the
4706target and information about things like stack usage in the kernel and called
4707functions. Some of this initialization requires the compiler to request certain
4708User and System SGPRs be present in the
4709:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4710:ref:`amdgpu-amdhsa-kernel-descriptor`.
4711
4712.. _amdgpu-amdhsa-kernel-prolog-cfi:
4713
4714CFI
4715+++
4716
47171.  The CFI return address is undefined.
4718
47192.  The CFI CFA is defined using an expression which evaluates to a location
4720    description that comprises one memory location description for the
4721    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4722
4723.. _amdgpu-amdhsa-kernel-prolog-m0:
4724
4725M0
4726++
4727
4728GFX6-GFX8
4729  The M0 register must be initialized with a value at least the total LDS size
4730  if the kernel may access LDS via DS or flat operations. Total LDS size is
4731  available in dispatch packet. For M0, it is also possible to use maximum
4732  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4733  GFX7-GFX8).
4734GFX9-GFX10
4735  The M0 register is not used for range checking LDS accesses and so does not
4736  need to be initialized in the prolog.
4737
4738.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4739
4740Stack Pointer
4741+++++++++++++
4742
4743If the kernel has function calls it must set up the ABI stack pointer described
4744in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4745SGPR32 to the unswizzled scratch offset of the address past the last local
4746allocation.
4747
4748.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4749
4750Frame Pointer
4751+++++++++++++
4752
4753If the kernel needs a frame pointer for the reasons defined in
4754``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4755kernel prolog. If a frame pointer is not required then all uses of the frame
4756pointer are replaced with immediate ``0`` offsets.
4757
4758.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4759
4760Flat Scratch
4761++++++++++++
4762
4763There are different methods used for initializing flat scratch:
4764
4765* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4766  specifies *Does not support generic address space*:
4767
4768  Flat scratch is not supported and there is no flat scratch register pair.
4769
4770* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4771  specifies *Offset flat scratch*:
4772
4773  If the kernel or any function it calls may use flat operations to access
4774  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4775  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4776  Scratch Wavefront Offset SGPR registers (see
4777  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4778
4779  1. The low word of Flat Scratch Init is the 32-bit byte offset from
4780     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4781     being managed by SPI for the queue executing the kernel dispatch. This is
4782     the same value used in the Scratch Segment Buffer V# base address.
4783
4784     CP obtains this from the runtime. (The Scratch Segment Buffer base address
4785     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4786
4787     The prolog must add the value of Scratch Wavefront Offset to get the
4788     wavefront's byte scratch backing memory offset from
4789     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4790
4791     The Scratch Wavefront Offset must also be used as an offset with Private
4792     segment address when using the Scratch Segment Buffer.
4793
4794     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4795     shifted by 8 before moving into FLAT_SCRATCH_HI.
4796
4797     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4798     SGPRn is the highest numbered SGPR allocated to the wavefront).
4799     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4800     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4801     FLAT SCRATCH BASE in flat memory instructions that access the scratch
4802     aperture.
4803  2. The second word of Flat Scratch Init is 32-bit byte size of a single
4804     work-items scratch memory usage.
4805
4806     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4807     checks that the value in the kernel dispatch packet Private Segment Byte
4808     Size is not larger and requests the runtime to increase the queue's scratch
4809     size if necessary.
4810
4811     CP directly loads from the kernel dispatch packet Private Segment Byte Size
4812     field and rounds up to a multiple of DWORD. Having CP load it once avoids
4813     loading it at the beginning of every wavefront.
4814
4815     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4816     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4817     in flat memory instructions.
4818
4819* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4820  specifies *Absolute flat scratch*:
4821
4822  If the kernel or any function it calls may use flat operations to access
4823  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4824  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4825  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4826  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4827
4828  The Flat Scratch Init is the 64-bit address of the base of scratch backing
4829  memory being managed by SPI for the queue executing the kernel dispatch.
4830
4831  CP obtains this from the runtime.
4832
4833  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4834  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4835  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4836  memory instructions.
4837
4838  The Scratch Wavefront Offset must also be used as an offset with Private
4839  segment address when using the Scratch Segment Buffer (see
4840  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4841
4842* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4843  specifies *Architected flat scratch*:
4844
4845  If ENABLE_PRIVATE_SEGMENT is enabled in
4846  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4847  register pair will be initialized to the 64-bit address of the base of scratch
4848  backing memory being managed by SPI for the queue executing the kernel
4849  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4850  flat scratch base in flat memory instructions.
4851
4852.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4853
4854Private Segment Buffer
4855++++++++++++++++++++++
4856
4857If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4858*Architected flat scratch* then a Private Segment Buffer is not supported.
4859Instead the flat SCRATCH instructions are used.
4860
4861Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4862that are used as a V# to access scratch. CP uses the value provided by the
4863runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4864access the private memory space using a segment address. See
4865:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4866
4867The scratch V# is a four-aligned SGPR and always selected for the kernel as
4868follows:
4869
4870  - If it is known during instruction selection that there is stack usage,
4871    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4872    optimizations are disabled (``-O0``), if stack objects already exist (for
4873    locals, etc.), or if there are any function calls.
4874
4875  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4876    are reserved for the tentative scratch V#. These will be used if it is
4877    determined that spilling is needed.
4878
4879    - If no use is made of the tentative scratch V#, then it is unreserved,
4880      and the register count is determined ignoring it.
4881    - If use is made of the tentative scratch V#, then its register numbers
4882      are shifted to the first four-aligned SGPR index after the highest one
4883      allocated by the register allocator, and all uses are updated. The
4884      register count includes them in the shifted location.
4885    - In either case, if the processor has the SGPR allocation bug, the
4886      tentative allocation is not shifted or unreserved in order to ensure
4887      the register count is higher to workaround the bug.
4888
4889    .. note::
4890
4891      This approach of using a tentative scratch V# and shifting the register
4892      numbers if used avoids having to perform register allocation a second
4893      time if the tentative V# is eliminated. This is more efficient and
4894      avoids the problem that the second register allocation may perform
4895      spilling which will fail as there is no longer a scratch V#.
4896
4897When the kernel prolog code is being emitted it is known whether the scratch V#
4898described above is actually used. If it is, the prolog code must set it up by
4899copying the Private Segment Buffer to the scratch V# registers and then adding
4900the Private Segment Wavefront Offset to the queue base address in the V#. The
4901result is a V# with a base address pointing to the beginning of the wavefront
4902scratch backing memory.
4903
4904The Private Segment Buffer is always requested, but the Private Segment
4905Wavefront Offset is only requested if it is used (see
4906:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4907
4908.. _amdgpu-amdhsa-memory-model:
4909
4910Memory Model
4911~~~~~~~~~~~~
4912
4913This section describes the mapping of the LLVM memory model onto AMDGPU machine
4914code (see :ref:`memmodel`).
4915
4916The AMDGPU backend supports the memory synchronization scopes specified in
4917:ref:`amdgpu-memory-scopes`.
4918
4919The code sequences used to implement the memory model specify the order of
4920instructions that a single thread must execute. The ``s_waitcnt`` and cache
4921management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4922to other memory instructions executed by the same thread. This allows them to be
4923moved earlier or later which can allow them to be combined with other instances
4924of the same instruction, or hoisted/sunk out of loops to improve performance.
4925Only the instructions related to the memory model are given; additional
4926``s_waitcnt`` instructions are required to ensure registers are defined before
4927being used. These may be able to be combined with the memory model ``s_waitcnt``
4928instructions as described above.
4929
4930The AMDGPU backend supports the following memory models:
4931
4932  HSA Memory Model [HSA]_
4933    The HSA memory model uses a single happens-before relation for all address
4934    spaces (see :ref:`amdgpu-address-spaces`).
4935  OpenCL Memory Model [OpenCL]_
4936    The OpenCL memory model which has separate happens-before relations for the
4937    global and local address spaces. Only a fence specifying both global and
4938    local address space, and seq_cst instructions join the relationships. Since
4939    the LLVM ``memfence`` instruction does not allow an address space to be
4940    specified the OpenCL fence has to conservatively assume both local and
4941    global address space was specified. However, optimizations can often be
4942    done to eliminate the additional ``s_waitcnt`` instructions when there are
4943    no intervening memory instructions which access the corresponding address
4944    space. The code sequences in the table indicate what can be omitted for the
4945    OpenCL memory. The target triple environment is used to determine if the
4946    source language is OpenCL (see :ref:`amdgpu-opencl`).
4947
4948``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4949operations.
4950
4951``buffer/global/flat_load/store/atomic`` instructions to global memory are
4952termed vector memory operations.
4953
4954Private address space uses ``buffer_load/store`` using the scratch V#
4955(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4956is accessing the memory, atomic memory orderings are not meaningful, and all
4957accesses are treated as non-atomic.
4958
4959Constant address space uses ``buffer/global_load`` instructions (or equivalent
4960scalar memory instructions). Since the constant address space contents do not
4961change during the execution of a kernel dispatch it is not legal to perform
4962stores, and atomic memory orderings are not meaningful, and all accesses are
4963treated as non-atomic.
4964
4965A memory synchronization scope wider than work-group is not meaningful for the
4966group (LDS) address space and is treated as work-group.
4967
4968The memory model does not support the region address space which is treated as
4969non-atomic.
4970
4971Acquire memory ordering is not meaningful on store atomic instructions and is
4972treated as non-atomic.
4973
4974Release memory ordering is not meaningful on load atomic instructions and is
4975treated a non-atomic.
4976
4977Acquire-release memory ordering is not meaningful on load or store atomic
4978instructions and is treated as acquire and release respectively.
4979
4980The memory order also adds the single thread optimization constraints defined in
4981table
4982:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4983
4984  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4985     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4986
4987     ============ ==============================================================
4988     LLVM Memory  Optimization Constraints
4989     Ordering
4990     ============ ==============================================================
4991     unordered    *none*
4992     monotonic    *none*
4993     acquire      - If a load atomic/atomicrmw then no following load/load
4994                    atomic/store/store atomic/atomicrmw/fence instruction can be
4995                    moved before the acquire.
4996                  - If a fence then same as load atomic, plus no preceding
4997                    associated fence-paired-atomic can be moved after the fence.
4998     release      - If a store atomic/atomicrmw then no preceding load/load
4999                    atomic/store/store atomic/atomicrmw/fence instruction can be
5000                    moved after the release.
5001                  - If a fence then same as store atomic, plus no following
5002                    associated fence-paired-atomic can be moved before the
5003                    fence.
5004     acq_rel      Same constraints as both acquire and release.
5005     seq_cst      - If a load atomic then same constraints as acquire, plus no
5006                    preceding sequentially consistent load atomic/store
5007                    atomic/atomicrmw/fence instruction can be moved after the
5008                    seq_cst.
5009                  - If a store atomic then the same constraints as release, plus
5010                    no following sequentially consistent load atomic/store
5011                    atomic/atomicrmw/fence instruction can be moved before the
5012                    seq_cst.
5013                  - If an atomicrmw/fence then same constraints as acq_rel.
5014     ============ ==============================================================
5015
5016The code sequences used to implement the memory model are defined in the
5017following sections:
5018
5019* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5020* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5021* :ref:`amdgpu-amdhsa-memory-model-gfx940`
5022* :ref:`amdgpu-amdhsa-memory-model-gfx10`
5023
5024.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5025
5026Memory Model GFX6-GFX9
5027++++++++++++++++++++++
5028
5029For GFX6-GFX9:
5030
5031* Each agent has multiple shader arrays (SA).
5032* Each SA has multiple compute units (CU).
5033* Each CU has multiple SIMDs that execute wavefronts.
5034* The wavefronts for a single work-group are executed in the same CU but may be
5035  executed by different SIMDs.
5036* Each CU has a single LDS memory shared by the wavefronts of the work-groups
5037  executing on it.
5038* All LDS operations of a CU are performed as wavefront wide operations in a
5039  global order and involve no caching. Completion is reported to a wavefront in
5040  execution order.
5041* The LDS memory has multiple request queues shared by the SIMDs of a
5042  CU. Therefore, the LDS operations performed by different wavefronts of a
5043  work-group can be reordered relative to each other, which can result in
5044  reordering the visibility of vector memory operations with respect to LDS
5045  operations of other wavefronts in the same work-group. A ``s_waitcnt
5046  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5047  vector memory operations between wavefronts of a work-group, but not between
5048  operations performed by the same wavefront.
5049* The vector memory operations are performed as wavefront wide operations and
5050  completion is reported to a wavefront in execution order. The exception is
5051  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5052  vector memory order if they access LDS memory, and out of LDS operation order
5053  if they access global memory.
5054* The vector memory operations access a single vector L1 cache shared by all
5055  SIMDs a CU. Therefore, no special action is required for coherence between the
5056  lanes of a single wavefront, or for coherence between wavefronts in the same
5057  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5058  wavefronts executing in different work-groups as they may be executing on
5059  different CUs.
5060* The scalar memory operations access a scalar L1 cache shared by all wavefronts
5061  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5062  scalar operations are used in a restricted way so do not impact the memory
5063  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5064* The vector and scalar memory operations use an L2 cache shared by all CUs on
5065  the same agent.
5066* The L2 cache has independent channels to service disjoint ranges of virtual
5067  addresses.
5068* Each CU has a separate request queue per channel. Therefore, the vector and
5069  scalar memory operations performed by wavefronts executing in different
5070  work-groups (which may be executing on different CUs) of an agent can be
5071  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5072  ensure synchronization between vector memory operations of different CUs. It
5073  ensures a previous vector memory operation has completed before executing a
5074  subsequent vector memory or LDS operation and so can be used to meet the
5075  requirements of acquire and release.
5076* The L2 cache can be kept coherent with other agents on some targets, or ranges
5077  of virtual addresses can be set up to bypass it to ensure system coherence.
5078
5079Scalar memory operations are only used to access memory that is proven to not
5080change during the execution of the kernel dispatch. This includes constant
5081address space and global address space for program scope ``const`` variables.
5082Therefore, the kernel machine code does not have to maintain the scalar cache to
5083ensure it is coherent with the vector caches. The scalar and vector caches are
5084invalidated between kernel dispatches by CP since constant address space data
5085may change between kernel dispatch executions. See
5086:ref:`amdgpu-amdhsa-memory-spaces`.
5087
5088The one exception is if scalar writes are used to spill SGPR registers. In this
5089case the AMDGPU backend ensures the memory location used to spill is never
5090accessed by vector memory operations at the same time. If scalar writes are used
5091then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5092return since the locations may be used for vector memory instructions by a
5093future wavefront that uses the same scratch area, or a function call that
5094creates a frame at the same address, respectively. There is no need for a
5095``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5096
5097For kernarg backing memory:
5098
5099* CP invalidates the L1 cache at the start of each kernel dispatch.
5100* On dGPU the kernarg backing memory is allocated in host memory accessed as
5101  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5102  causes it to be treated as non-volatile and so is not invalidated by
5103  ``*_vol``.
5104* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5105  and so the L2 cache will be coherent with the CPU and other agents.
5106
5107Scratch backing memory (which is used for the private address space) is accessed
5108with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5109only accessed by a single thread, and is always write-before-read, there is
5110never a need to invalidate these entries from the L1 cache. Hence all cache
5111invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5112
5113The code sequences used to implement the memory model for GFX6-GFX9 are defined
5114in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5115
5116  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5117     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5118
5119     ============ ============ ============== ========== ================================
5120     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
5121                  Ordering     Sync Scope     Address    GFX6-GFX9
5122                                              Space
5123     ============ ============ ============== ========== ================================
5124     **Non-Atomic**
5125     ------------------------------------------------------------------------------------
5126     load         *none*       *none*         - global   - !volatile & !nontemporal
5127                                              - generic
5128                                              - private    1. buffer/global/flat_load
5129                                              - constant
5130                                                         - !volatile & nontemporal
5131
5132                                                           1. buffer/global/flat_load
5133                                                              glc=1 slc=1
5134
5135                                                         - volatile
5136
5137                                                           1. buffer/global/flat_load
5138                                                              glc=1
5139                                                           2. s_waitcnt vmcnt(0)
5140
5141                                                            - Must happen before
5142                                                              any following volatile
5143                                                              global/generic
5144                                                              load/store.
5145                                                            - Ensures that
5146                                                              volatile
5147                                                              operations to
5148                                                              different
5149                                                              addresses will not
5150                                                              be reordered by
5151                                                              hardware.
5152
5153     load         *none*       *none*         - local    1. ds_load
5154     store        *none*       *none*         - global   - !volatile & !nontemporal
5155                                              - generic
5156                                              - private    1. buffer/global/flat_store
5157                                              - constant
5158                                                         - !volatile & nontemporal
5159
5160                                                           1. buffer/global/flat_store
5161                                                              glc=1 slc=1
5162
5163                                                         - volatile
5164
5165                                                           1. buffer/global/flat_store
5166                                                           2. s_waitcnt vmcnt(0)
5167
5168                                                            - Must happen before
5169                                                              any following volatile
5170                                                              global/generic
5171                                                              load/store.
5172                                                            - Ensures that
5173                                                              volatile
5174                                                              operations to
5175                                                              different
5176                                                              addresses will not
5177                                                              be reordered by
5178                                                              hardware.
5179
5180     store        *none*       *none*         - local    1. ds_store
5181     **Unordered Atomic**
5182     ------------------------------------------------------------------------------------
5183     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5184     store atomic unordered    *any*          *any*      *Same as non-atomic*.
5185     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5186     **Monotonic Atomic**
5187     ------------------------------------------------------------------------------------
5188     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5189                               - wavefront    - local
5190                               - workgroup    - generic
5191     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5192                               - system       - generic     glc=1
5193     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5194                               - wavefront    - generic
5195                               - workgroup
5196                               - agent
5197                               - system
5198     store atomic monotonic    - singlethread - local    1. ds_store
5199                               - wavefront
5200                               - workgroup
5201     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5202                               - wavefront    - generic
5203                               - workgroup
5204                               - agent
5205                               - system
5206     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5207                               - wavefront
5208                               - workgroup
5209     **Acquire Atomic**
5210     ------------------------------------------------------------------------------------
5211     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5212                               - wavefront    - local
5213                                              - generic
5214     load atomic  acquire      - workgroup    - global   1. buffer/global_load
5215     load atomic  acquire      - workgroup    - local    1. ds/flat_load
5216                                              - generic  2. s_waitcnt lgkmcnt(0)
5217
5218                                                           - If OpenCL, omit.
5219                                                           - Must happen before
5220                                                             any following
5221                                                             global/generic
5222                                                             load/load
5223                                                             atomic/store/store
5224                                                             atomic/atomicrmw.
5225                                                           - Ensures any
5226                                                             following global
5227                                                             data read is no
5228                                                             older than a local load
5229                                                             atomic value being
5230                                                             acquired.
5231
5232     load atomic  acquire      - agent        - global   1. buffer/global_load
5233                               - system                     glc=1
5234                                                         2. s_waitcnt vmcnt(0)
5235
5236                                                           - Must happen before
5237                                                             following
5238                                                             buffer_wbinvl1_vol.
5239                                                           - Ensures the load
5240                                                             has completed
5241                                                             before invalidating
5242                                                             the cache.
5243
5244                                                         3. buffer_wbinvl1_vol
5245
5246                                                           - Must happen before
5247                                                             any following
5248                                                             global/generic
5249                                                             load/load
5250                                                             atomic/atomicrmw.
5251                                                           - Ensures that
5252                                                             following
5253                                                             loads will not see
5254                                                             stale global data.
5255
5256     load atomic  acquire      - agent        - generic  1. flat_load glc=1
5257                               - system                  2. s_waitcnt vmcnt(0) &
5258                                                            lgkmcnt(0)
5259
5260                                                           - If OpenCL omit
5261                                                             lgkmcnt(0).
5262                                                           - Must happen before
5263                                                             following
5264                                                             buffer_wbinvl1_vol.
5265                                                           - Ensures the flat_load
5266                                                             has completed
5267                                                             before invalidating
5268                                                             the cache.
5269
5270                                                         3. buffer_wbinvl1_vol
5271
5272                                                           - Must happen before
5273                                                             any following
5274                                                             global/generic
5275                                                             load/load
5276                                                             atomic/atomicrmw.
5277                                                           - Ensures that
5278                                                             following loads
5279                                                             will not see stale
5280                                                             global data.
5281
5282     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5283                               - wavefront    - local
5284                                              - generic
5285     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5286     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5287                                              - generic  2. s_waitcnt lgkmcnt(0)
5288
5289                                                           - If OpenCL, omit.
5290                                                           - Must happen before
5291                                                             any following
5292                                                             global/generic
5293                                                             load/load
5294                                                             atomic/store/store
5295                                                             atomic/atomicrmw.
5296                                                           - Ensures any
5297                                                             following global
5298                                                             data read is no
5299                                                             older than a local
5300                                                             atomicrmw value
5301                                                             being acquired.
5302
5303     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5304                               - system                  2. s_waitcnt vmcnt(0)
5305
5306                                                           - Must happen before
5307                                                             following
5308                                                             buffer_wbinvl1_vol.
5309                                                           - Ensures the
5310                                                             atomicrmw has
5311                                                             completed before
5312                                                             invalidating the
5313                                                             cache.
5314
5315                                                         3. buffer_wbinvl1_vol
5316
5317                                                           - Must happen before
5318                                                             any following
5319                                                             global/generic
5320                                                             load/load
5321                                                             atomic/atomicrmw.
5322                                                           - Ensures that
5323                                                             following loads
5324                                                             will not see stale
5325                                                             global data.
5326
5327     atomicrmw    acquire      - agent        - generic  1. flat_atomic
5328                               - system                  2. s_waitcnt vmcnt(0) &
5329                                                            lgkmcnt(0)
5330
5331                                                           - If OpenCL, omit
5332                                                             lgkmcnt(0).
5333                                                           - Must happen before
5334                                                             following
5335                                                             buffer_wbinvl1_vol.
5336                                                           - Ensures the
5337                                                             atomicrmw has
5338                                                             completed before
5339                                                             invalidating the
5340                                                             cache.
5341
5342                                                         3. buffer_wbinvl1_vol
5343
5344                                                           - Must happen before
5345                                                             any following
5346                                                             global/generic
5347                                                             load/load
5348                                                             atomic/atomicrmw.
5349                                                           - Ensures that
5350                                                             following loads
5351                                                             will not see stale
5352                                                             global data.
5353
5354     fence        acquire      - singlethread *none*     *none*
5355                               - wavefront
5356     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5357
5358                                                           - If OpenCL and
5359                                                             address space is
5360                                                             not generic, omit.
5361                                                           - However, since LLVM
5362                                                             currently has no
5363                                                             address space on
5364                                                             the fence need to
5365                                                             conservatively
5366                                                             always generate. If
5367                                                             fence had an
5368                                                             address space then
5369                                                             set to address
5370                                                             space of OpenCL
5371                                                             fence flag, or to
5372                                                             generic if both
5373                                                             local and global
5374                                                             flags are
5375                                                             specified.
5376                                                           - Must happen after
5377                                                             any preceding
5378                                                             local/generic load
5379                                                             atomic/atomicrmw
5380                                                             with an equal or
5381                                                             wider sync scope
5382                                                             and memory ordering
5383                                                             stronger than
5384                                                             unordered (this is
5385                                                             termed the
5386                                                             fence-paired-atomic).
5387                                                           - Must happen before
5388                                                             any following
5389                                                             global/generic
5390                                                             load/load
5391                                                             atomic/store/store
5392                                                             atomic/atomicrmw.
5393                                                           - Ensures any
5394                                                             following global
5395                                                             data read is no
5396                                                             older than the
5397                                                             value read by the
5398                                                             fence-paired-atomic.
5399
5400     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5401                               - system                     vmcnt(0)
5402
5403                                                           - If OpenCL and
5404                                                             address space is
5405                                                             not generic, omit
5406                                                             lgkmcnt(0).
5407                                                           - However, since LLVM
5408                                                             currently has no
5409                                                             address space on
5410                                                             the fence need to
5411                                                             conservatively
5412                                                             always generate
5413                                                             (see comment for
5414                                                             previous fence).
5415                                                           - Could be split into
5416                                                             separate s_waitcnt
5417                                                             vmcnt(0) and
5418                                                             s_waitcnt
5419                                                             lgkmcnt(0) to allow
5420                                                             them to be
5421                                                             independently moved
5422                                                             according to the
5423                                                             following rules.
5424                                                           - s_waitcnt vmcnt(0)
5425                                                             must happen after
5426                                                             any preceding
5427                                                             global/generic load
5428                                                             atomic/atomicrmw
5429                                                             with an equal or
5430                                                             wider sync scope
5431                                                             and memory ordering
5432                                                             stronger than
5433                                                             unordered (this is
5434                                                             termed the
5435                                                             fence-paired-atomic).
5436                                                           - s_waitcnt lgkmcnt(0)
5437                                                             must happen after
5438                                                             any preceding
5439                                                             local/generic load
5440                                                             atomic/atomicrmw
5441                                                             with an equal or
5442                                                             wider sync scope
5443                                                             and memory ordering
5444                                                             stronger than
5445                                                             unordered (this is
5446                                                             termed the
5447                                                             fence-paired-atomic).
5448                                                           - Must happen before
5449                                                             the following
5450                                                             buffer_wbinvl1_vol.
5451                                                           - Ensures that the
5452                                                             fence-paired atomic
5453                                                             has completed
5454                                                             before invalidating
5455                                                             the
5456                                                             cache. Therefore
5457                                                             any following
5458                                                             locations read must
5459                                                             be no older than
5460                                                             the value read by
5461                                                             the
5462                                                             fence-paired-atomic.
5463
5464                                                         2. buffer_wbinvl1_vol
5465
5466                                                           - Must happen before any
5467                                                             following global/generic
5468                                                             load/load
5469                                                             atomic/store/store
5470                                                             atomic/atomicrmw.
5471                                                           - Ensures that
5472                                                             following loads
5473                                                             will not see stale
5474                                                             global data.
5475
5476     **Release Atomic**
5477     ------------------------------------------------------------------------------------
5478     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5479                               - wavefront    - local
5480                                              - generic
5481     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5482                                              - generic
5483                                                           - If OpenCL, omit.
5484                                                           - Must happen after
5485                                                             any preceding
5486                                                             local/generic
5487                                                             load/store/load
5488                                                             atomic/store
5489                                                             atomic/atomicrmw.
5490                                                           - Must happen before
5491                                                             the following
5492                                                             store.
5493                                                           - Ensures that all
5494                                                             memory operations
5495                                                             to local have
5496                                                             completed before
5497                                                             performing the
5498                                                             store that is being
5499                                                             released.
5500
5501                                                         2. buffer/global/flat_store
5502     store atomic release      - workgroup    - local    1. ds_store
5503     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5504                               - system       - generic     vmcnt(0)
5505
5506                                                           - If OpenCL and
5507                                                             address space is
5508                                                             not generic, omit
5509                                                             lgkmcnt(0).
5510                                                           - Could be split into
5511                                                             separate s_waitcnt
5512                                                             vmcnt(0) and
5513                                                             s_waitcnt
5514                                                             lgkmcnt(0) to allow
5515                                                             them to be
5516                                                             independently moved
5517                                                             according to the
5518                                                             following rules.
5519                                                           - s_waitcnt vmcnt(0)
5520                                                             must happen after
5521                                                             any preceding
5522                                                             global/generic
5523                                                             load/store/load
5524                                                             atomic/store
5525                                                             atomic/atomicrmw.
5526                                                           - s_waitcnt lgkmcnt(0)
5527                                                             must happen after
5528                                                             any preceding
5529                                                             local/generic
5530                                                             load/store/load
5531                                                             atomic/store
5532                                                             atomic/atomicrmw.
5533                                                           - Must happen before
5534                                                             the following
5535                                                             store.
5536                                                           - Ensures that all
5537                                                             memory operations
5538                                                             to memory have
5539                                                             completed before
5540                                                             performing the
5541                                                             store that is being
5542                                                             released.
5543
5544                                                         2. buffer/global/flat_store
5545     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5546                               - wavefront    - local
5547                                              - generic
5548     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5549                                              - generic
5550                                                           - If OpenCL, omit.
5551                                                           - Must happen after
5552                                                             any preceding
5553                                                             local/generic
5554                                                             load/store/load
5555                                                             atomic/store
5556                                                             atomic/atomicrmw.
5557                                                           - Must happen before
5558                                                             the following
5559                                                             atomicrmw.
5560                                                           - Ensures that all
5561                                                             memory operations
5562                                                             to local have
5563                                                             completed before
5564                                                             performing the
5565                                                             atomicrmw that is
5566                                                             being released.
5567
5568                                                         2. buffer/global/flat_atomic
5569     atomicrmw    release      - workgroup    - local    1. ds_atomic
5570     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5571                               - system       - generic     vmcnt(0)
5572
5573                                                           - If OpenCL, omit
5574                                                             lgkmcnt(0).
5575                                                           - Could be split into
5576                                                             separate s_waitcnt
5577                                                             vmcnt(0) and
5578                                                             s_waitcnt
5579                                                             lgkmcnt(0) to allow
5580                                                             them to be
5581                                                             independently moved
5582                                                             according to the
5583                                                             following rules.
5584                                                           - s_waitcnt vmcnt(0)
5585                                                             must happen after
5586                                                             any preceding
5587                                                             global/generic
5588                                                             load/store/load
5589                                                             atomic/store
5590                                                             atomic/atomicrmw.
5591                                                           - s_waitcnt lgkmcnt(0)
5592                                                             must happen after
5593                                                             any preceding
5594                                                             local/generic
5595                                                             load/store/load
5596                                                             atomic/store
5597                                                             atomic/atomicrmw.
5598                                                           - Must happen before
5599                                                             the following
5600                                                             atomicrmw.
5601                                                           - Ensures that all
5602                                                             memory operations
5603                                                             to global and local
5604                                                             have completed
5605                                                             before performing
5606                                                             the atomicrmw that
5607                                                             is being released.
5608
5609                                                         2. buffer/global/flat_atomic
5610     fence        release      - singlethread *none*     *none*
5611                               - wavefront
5612     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5613
5614                                                           - If OpenCL and
5615                                                             address space is
5616                                                             not generic, omit.
5617                                                           - However, since LLVM
5618                                                             currently has no
5619                                                             address space on
5620                                                             the fence need to
5621                                                             conservatively
5622                                                             always generate. If
5623                                                             fence had an
5624                                                             address space then
5625                                                             set to address
5626                                                             space of OpenCL
5627                                                             fence flag, or to
5628                                                             generic if both
5629                                                             local and global
5630                                                             flags are
5631                                                             specified.
5632                                                           - Must happen after
5633                                                             any preceding
5634                                                             local/generic
5635                                                             load/load
5636                                                             atomic/store/store
5637                                                             atomic/atomicrmw.
5638                                                           - Must happen before
5639                                                             any following store
5640                                                             atomic/atomicrmw
5641                                                             with an equal or
5642                                                             wider sync scope
5643                                                             and memory ordering
5644                                                             stronger than
5645                                                             unordered (this is
5646                                                             termed the
5647                                                             fence-paired-atomic).
5648                                                           - Ensures that all
5649                                                             memory operations
5650                                                             to local have
5651                                                             completed before
5652                                                             performing the
5653                                                             following
5654                                                             fence-paired-atomic.
5655
5656     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5657                               - system                     vmcnt(0)
5658
5659                                                           - If OpenCL and
5660                                                             address space is
5661                                                             not generic, omit
5662                                                             lgkmcnt(0).
5663                                                           - If OpenCL and
5664                                                             address space is
5665                                                             local, omit
5666                                                             vmcnt(0).
5667                                                           - However, since LLVM
5668                                                             currently has no
5669                                                             address space on
5670                                                             the fence need to
5671                                                             conservatively
5672                                                             always generate. If
5673                                                             fence had an
5674                                                             address space then
5675                                                             set to address
5676                                                             space of OpenCL
5677                                                             fence flag, or to
5678                                                             generic if both
5679                                                             local and global
5680                                                             flags are
5681                                                             specified.
5682                                                           - Could be split into
5683                                                             separate s_waitcnt
5684                                                             vmcnt(0) and
5685                                                             s_waitcnt
5686                                                             lgkmcnt(0) to allow
5687                                                             them to be
5688                                                             independently moved
5689                                                             according to the
5690                                                             following rules.
5691                                                           - s_waitcnt vmcnt(0)
5692                                                             must happen after
5693                                                             any preceding
5694                                                             global/generic
5695                                                             load/store/load
5696                                                             atomic/store
5697                                                             atomic/atomicrmw.
5698                                                           - s_waitcnt lgkmcnt(0)
5699                                                             must happen after
5700                                                             any preceding
5701                                                             local/generic
5702                                                             load/store/load
5703                                                             atomic/store
5704                                                             atomic/atomicrmw.
5705                                                           - Must happen before
5706                                                             any following store
5707                                                             atomic/atomicrmw
5708                                                             with an equal or
5709                                                             wider sync scope
5710                                                             and memory ordering
5711                                                             stronger than
5712                                                             unordered (this is
5713                                                             termed the
5714                                                             fence-paired-atomic).
5715                                                           - Ensures that all
5716                                                             memory operations
5717                                                             have
5718                                                             completed before
5719                                                             performing the
5720                                                             following
5721                                                             fence-paired-atomic.
5722
5723     **Acquire-Release Atomic**
5724     ------------------------------------------------------------------------------------
5725     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5726                               - wavefront    - local
5727                                              - generic
5728     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5729
5730                                                           - If OpenCL, omit.
5731                                                           - Must happen after
5732                                                             any preceding
5733                                                             local/generic
5734                                                             load/store/load
5735                                                             atomic/store
5736                                                             atomic/atomicrmw.
5737                                                           - Must happen before
5738                                                             the following
5739                                                             atomicrmw.
5740                                                           - Ensures that all
5741                                                             memory operations
5742                                                             to local have
5743                                                             completed before
5744                                                             performing the
5745                                                             atomicrmw that is
5746                                                             being released.
5747
5748                                                         2. buffer/global_atomic
5749
5750     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5751                                                         2. s_waitcnt lgkmcnt(0)
5752
5753                                                           - If OpenCL, omit.
5754                                                           - Must happen before
5755                                                             any following
5756                                                             global/generic
5757                                                             load/load
5758                                                             atomic/store/store
5759                                                             atomic/atomicrmw.
5760                                                           - Ensures any
5761                                                             following global
5762                                                             data read is no
5763                                                             older than the local load
5764                                                             atomic value being
5765                                                             acquired.
5766
5767     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5768
5769                                                           - If OpenCL, omit.
5770                                                           - Must happen after
5771                                                             any preceding
5772                                                             local/generic
5773                                                             load/store/load
5774                                                             atomic/store
5775                                                             atomic/atomicrmw.
5776                                                           - Must happen before
5777                                                             the following
5778                                                             atomicrmw.
5779                                                           - Ensures that all
5780                                                             memory operations
5781                                                             to local have
5782                                                             completed before
5783                                                             performing the
5784                                                             atomicrmw that is
5785                                                             being released.
5786
5787                                                         2. flat_atomic
5788                                                         3. s_waitcnt lgkmcnt(0)
5789
5790                                                           - If OpenCL, omit.
5791                                                           - Must happen before
5792                                                             any following
5793                                                             global/generic
5794                                                             load/load
5795                                                             atomic/store/store
5796                                                             atomic/atomicrmw.
5797                                                           - Ensures any
5798                                                             following global
5799                                                             data read is no
5800                                                             older than a local load
5801                                                             atomic value being
5802                                                             acquired.
5803
5804     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5805                               - system                     vmcnt(0)
5806
5807                                                           - If OpenCL, omit
5808                                                             lgkmcnt(0).
5809                                                           - Could be split into
5810                                                             separate s_waitcnt
5811                                                             vmcnt(0) and
5812                                                             s_waitcnt
5813                                                             lgkmcnt(0) to allow
5814                                                             them to be
5815                                                             independently moved
5816                                                             according to the
5817                                                             following rules.
5818                                                           - s_waitcnt vmcnt(0)
5819                                                             must happen after
5820                                                             any preceding
5821                                                             global/generic
5822                                                             load/store/load
5823                                                             atomic/store
5824                                                             atomic/atomicrmw.
5825                                                           - s_waitcnt lgkmcnt(0)
5826                                                             must happen after
5827                                                             any preceding
5828                                                             local/generic
5829                                                             load/store/load
5830                                                             atomic/store
5831                                                             atomic/atomicrmw.
5832                                                           - Must happen before
5833                                                             the following
5834                                                             atomicrmw.
5835                                                           - Ensures that all
5836                                                             memory operations
5837                                                             to global have
5838                                                             completed before
5839                                                             performing the
5840                                                             atomicrmw that is
5841                                                             being released.
5842
5843                                                         2. buffer/global_atomic
5844                                                         3. s_waitcnt vmcnt(0)
5845
5846                                                           - Must happen before
5847                                                             following
5848                                                             buffer_wbinvl1_vol.
5849                                                           - Ensures the
5850                                                             atomicrmw has
5851                                                             completed before
5852                                                             invalidating the
5853                                                             cache.
5854
5855                                                         4. buffer_wbinvl1_vol
5856
5857                                                           - Must happen before
5858                                                             any following
5859                                                             global/generic
5860                                                             load/load
5861                                                             atomic/atomicrmw.
5862                                                           - Ensures that
5863                                                             following loads
5864                                                             will not see stale
5865                                                             global data.
5866
5867     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5868                               - system                     vmcnt(0)
5869
5870                                                           - If OpenCL, omit
5871                                                             lgkmcnt(0).
5872                                                           - Could be split into
5873                                                             separate s_waitcnt
5874                                                             vmcnt(0) and
5875                                                             s_waitcnt
5876                                                             lgkmcnt(0) to allow
5877                                                             them to be
5878                                                             independently moved
5879                                                             according to the
5880                                                             following rules.
5881                                                           - s_waitcnt vmcnt(0)
5882                                                             must happen after
5883                                                             any preceding
5884                                                             global/generic
5885                                                             load/store/load
5886                                                             atomic/store
5887                                                             atomic/atomicrmw.
5888                                                           - s_waitcnt lgkmcnt(0)
5889                                                             must happen after
5890                                                             any preceding
5891                                                             local/generic
5892                                                             load/store/load
5893                                                             atomic/store
5894                                                             atomic/atomicrmw.
5895                                                           - Must happen before
5896                                                             the following
5897                                                             atomicrmw.
5898                                                           - Ensures that all
5899                                                             memory operations
5900                                                             to global have
5901                                                             completed before
5902                                                             performing the
5903                                                             atomicrmw that is
5904                                                             being released.
5905
5906                                                         2. flat_atomic
5907                                                         3. s_waitcnt vmcnt(0) &
5908                                                            lgkmcnt(0)
5909
5910                                                           - If OpenCL, omit
5911                                                             lgkmcnt(0).
5912                                                           - Must happen before
5913                                                             following
5914                                                             buffer_wbinvl1_vol.
5915                                                           - Ensures the
5916                                                             atomicrmw has
5917                                                             completed before
5918                                                             invalidating the
5919                                                             cache.
5920
5921                                                         4. buffer_wbinvl1_vol
5922
5923                                                           - Must happen before
5924                                                             any following
5925                                                             global/generic
5926                                                             load/load
5927                                                             atomic/atomicrmw.
5928                                                           - Ensures that
5929                                                             following loads
5930                                                             will not see stale
5931                                                             global data.
5932
5933     fence        acq_rel      - singlethread *none*     *none*
5934                               - wavefront
5935     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5936
5937                                                           - If OpenCL and
5938                                                             address space is
5939                                                             not generic, omit.
5940                                                           - However,
5941                                                             since LLVM
5942                                                             currently has no
5943                                                             address space on
5944                                                             the fence need to
5945                                                             conservatively
5946                                                             always generate
5947                                                             (see comment for
5948                                                             previous fence).
5949                                                           - Must happen after
5950                                                             any preceding
5951                                                             local/generic
5952                                                             load/load
5953                                                             atomic/store/store
5954                                                             atomic/atomicrmw.
5955                                                           - Must happen before
5956                                                             any following
5957                                                             global/generic
5958                                                             load/load
5959                                                             atomic/store/store
5960                                                             atomic/atomicrmw.
5961                                                           - Ensures that all
5962                                                             memory operations
5963                                                             to local have
5964                                                             completed before
5965                                                             performing any
5966                                                             following global
5967                                                             memory operations.
5968                                                           - Ensures that the
5969                                                             preceding
5970                                                             local/generic load
5971                                                             atomic/atomicrmw
5972                                                             with an equal or
5973                                                             wider sync scope
5974                                                             and memory ordering
5975                                                             stronger than
5976                                                             unordered (this is
5977                                                             termed the
5978                                                             acquire-fence-paired-atomic)
5979                                                             has completed
5980                                                             before following
5981                                                             global memory
5982                                                             operations. This
5983                                                             satisfies the
5984                                                             requirements of
5985                                                             acquire.
5986                                                           - Ensures that all
5987                                                             previous memory
5988                                                             operations have
5989                                                             completed before a
5990                                                             following
5991                                                             local/generic store
5992                                                             atomic/atomicrmw
5993                                                             with an equal or
5994                                                             wider sync scope
5995                                                             and memory ordering
5996                                                             stronger than
5997                                                             unordered (this is
5998                                                             termed the
5999                                                             release-fence-paired-atomic).
6000                                                             This satisfies the
6001                                                             requirements of
6002                                                             release.
6003
6004     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6005                               - system                     vmcnt(0)
6006
6007                                                           - If OpenCL and
6008                                                             address space is
6009                                                             not generic, omit
6010                                                             lgkmcnt(0).
6011                                                           - However, since LLVM
6012                                                             currently has no
6013                                                             address space on
6014                                                             the fence need to
6015                                                             conservatively
6016                                                             always generate
6017                                                             (see comment for
6018                                                             previous fence).
6019                                                           - Could be split into
6020                                                             separate s_waitcnt
6021                                                             vmcnt(0) and
6022                                                             s_waitcnt
6023                                                             lgkmcnt(0) to allow
6024                                                             them to be
6025                                                             independently moved
6026                                                             according to the
6027                                                             following rules.
6028                                                           - s_waitcnt vmcnt(0)
6029                                                             must happen after
6030                                                             any preceding
6031                                                             global/generic
6032                                                             load/store/load
6033                                                             atomic/store
6034                                                             atomic/atomicrmw.
6035                                                           - s_waitcnt lgkmcnt(0)
6036                                                             must happen after
6037                                                             any preceding
6038                                                             local/generic
6039                                                             load/store/load
6040                                                             atomic/store
6041                                                             atomic/atomicrmw.
6042                                                           - Must happen before
6043                                                             the following
6044                                                             buffer_wbinvl1_vol.
6045                                                           - Ensures that the
6046                                                             preceding
6047                                                             global/local/generic
6048                                                             load
6049                                                             atomic/atomicrmw
6050                                                             with an equal or
6051                                                             wider sync scope
6052                                                             and memory ordering
6053                                                             stronger than
6054                                                             unordered (this is
6055                                                             termed the
6056                                                             acquire-fence-paired-atomic)
6057                                                             has completed
6058                                                             before invalidating
6059                                                             the cache. This
6060                                                             satisfies the
6061                                                             requirements of
6062                                                             acquire.
6063                                                           - Ensures that all
6064                                                             previous memory
6065                                                             operations have
6066                                                             completed before a
6067                                                             following
6068                                                             global/local/generic
6069                                                             store
6070                                                             atomic/atomicrmw
6071                                                             with an equal or
6072                                                             wider sync scope
6073                                                             and memory ordering
6074                                                             stronger than
6075                                                             unordered (this is
6076                                                             termed the
6077                                                             release-fence-paired-atomic).
6078                                                             This satisfies the
6079                                                             requirements of
6080                                                             release.
6081
6082                                                         2. buffer_wbinvl1_vol
6083
6084                                                           - Must happen before
6085                                                             any following
6086                                                             global/generic
6087                                                             load/load
6088                                                             atomic/store/store
6089                                                             atomic/atomicrmw.
6090                                                           - Ensures that
6091                                                             following loads
6092                                                             will not see stale
6093                                                             global data. This
6094                                                             satisfies the
6095                                                             requirements of
6096                                                             acquire.
6097
6098     **Sequential Consistent Atomic**
6099     ------------------------------------------------------------------------------------
6100     load atomic  seq_cst      - singlethread - global   *Same as corresponding
6101                               - wavefront    - local    load atomic acquire,
6102                                              - generic  except must generate
6103                                                         all instructions even
6104                                                         for OpenCL.*
6105     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6106                                              - generic
6107
6108                                                           - Must
6109                                                             happen after
6110                                                             preceding
6111                                                             local/generic load
6112                                                             atomic/store
6113                                                             atomic/atomicrmw
6114                                                             with memory
6115                                                             ordering of seq_cst
6116                                                             and with equal or
6117                                                             wider sync scope.
6118                                                             (Note that seq_cst
6119                                                             fences have their
6120                                                             own s_waitcnt
6121                                                             lgkmcnt(0) and so do
6122                                                             not need to be
6123                                                             considered.)
6124                                                           - Ensures any
6125                                                             preceding
6126                                                             sequential
6127                                                             consistent local
6128                                                             memory instructions
6129                                                             have completed
6130                                                             before executing
6131                                                             this sequentially
6132                                                             consistent
6133                                                             instruction. This
6134                                                             prevents reordering
6135                                                             a seq_cst store
6136                                                             followed by a
6137                                                             seq_cst load. (Note
6138                                                             that seq_cst is
6139                                                             stronger than
6140                                                             acquire/release as
6141                                                             the reordering of
6142                                                             load acquire
6143                                                             followed by a store
6144                                                             release is
6145                                                             prevented by the
6146                                                             s_waitcnt of
6147                                                             the release, but
6148                                                             there is nothing
6149                                                             preventing a store
6150                                                             release followed by
6151                                                             load acquire from
6152                                                             completing out of
6153                                                             order. The s_waitcnt
6154                                                             could be placed after
6155                                                             seq_store or before
6156                                                             the seq_load. We
6157                                                             choose the load to
6158                                                             make the s_waitcnt be
6159                                                             as late as possible
6160                                                             so that the store
6161                                                             may have already
6162                                                             completed.)
6163
6164                                                         2. *Following
6165                                                            instructions same as
6166                                                            corresponding load
6167                                                            atomic acquire,
6168                                                            except must generate
6169                                                            all instructions even
6170                                                            for OpenCL.*
6171     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6172                                                         load atomic acquire,
6173                                                         except must generate
6174                                                         all instructions even
6175                                                         for OpenCL.*
6176
6177     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6178                               - system       - generic     vmcnt(0)
6179
6180                                                           - Could be split into
6181                                                             separate s_waitcnt
6182                                                             vmcnt(0)
6183                                                             and s_waitcnt
6184                                                             lgkmcnt(0) to allow
6185                                                             them to be
6186                                                             independently moved
6187                                                             according to the
6188                                                             following rules.
6189                                                           - s_waitcnt lgkmcnt(0)
6190                                                             must happen after
6191                                                             preceding
6192                                                             global/generic load
6193                                                             atomic/store
6194                                                             atomic/atomicrmw
6195                                                             with memory
6196                                                             ordering of seq_cst
6197                                                             and with equal or
6198                                                             wider sync scope.
6199                                                             (Note that seq_cst
6200                                                             fences have their
6201                                                             own s_waitcnt
6202                                                             lgkmcnt(0) and so do
6203                                                             not need to be
6204                                                             considered.)
6205                                                           - s_waitcnt vmcnt(0)
6206                                                             must happen after
6207                                                             preceding
6208                                                             global/generic load
6209                                                             atomic/store
6210                                                             atomic/atomicrmw
6211                                                             with memory
6212                                                             ordering of seq_cst
6213                                                             and with equal or
6214                                                             wider sync scope.
6215                                                             (Note that seq_cst
6216                                                             fences have their
6217                                                             own s_waitcnt
6218                                                             vmcnt(0) and so do
6219                                                             not need to be
6220                                                             considered.)
6221                                                           - Ensures any
6222                                                             preceding
6223                                                             sequential
6224                                                             consistent global
6225                                                             memory instructions
6226                                                             have completed
6227                                                             before executing
6228                                                             this sequentially
6229                                                             consistent
6230                                                             instruction. This
6231                                                             prevents reordering
6232                                                             a seq_cst store
6233                                                             followed by a
6234                                                             seq_cst load. (Note
6235                                                             that seq_cst is
6236                                                             stronger than
6237                                                             acquire/release as
6238                                                             the reordering of
6239                                                             load acquire
6240                                                             followed by a store
6241                                                             release is
6242                                                             prevented by the
6243                                                             s_waitcnt of
6244                                                             the release, but
6245                                                             there is nothing
6246                                                             preventing a store
6247                                                             release followed by
6248                                                             load acquire from
6249                                                             completing out of
6250                                                             order. The s_waitcnt
6251                                                             could be placed after
6252                                                             seq_store or before
6253                                                             the seq_load. We
6254                                                             choose the load to
6255                                                             make the s_waitcnt be
6256                                                             as late as possible
6257                                                             so that the store
6258                                                             may have already
6259                                                             completed.)
6260
6261                                                         2. *Following
6262                                                            instructions same as
6263                                                            corresponding load
6264                                                            atomic acquire,
6265                                                            except must generate
6266                                                            all instructions even
6267                                                            for OpenCL.*
6268     store atomic seq_cst      - singlethread - global   *Same as corresponding
6269                               - wavefront    - local    store atomic release,
6270                               - workgroup    - generic  except must generate
6271                               - agent                   all instructions even
6272                               - system                  for OpenCL.*
6273     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6274                               - wavefront    - local    atomicrmw acq_rel,
6275                               - workgroup    - generic  except must generate
6276                               - agent                   all instructions even
6277                               - system                  for OpenCL.*
6278     fence        seq_cst      - singlethread *none*     *Same as corresponding
6279                               - wavefront               fence acq_rel,
6280                               - workgroup               except must generate
6281                               - agent                   all instructions even
6282                               - system                  for OpenCL.*
6283     ============ ============ ============== ========== ================================
6284
6285.. _amdgpu-amdhsa-memory-model-gfx90a:
6286
6287Memory Model GFX90A
6288+++++++++++++++++++
6289
6290For GFX90A:
6291
6292* Each agent has multiple shader arrays (SA).
6293* Each SA has multiple compute units (CU).
6294* Each CU has multiple SIMDs that execute wavefronts.
6295* The wavefronts for a single work-group are executed in the same CU but may be
6296  executed by different SIMDs. The exception is when in tgsplit execution mode
6297  when the wavefronts may be executed by different SIMDs in different CUs.
6298* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6299  executing on it. The exception is when in tgsplit execution mode when no LDS
6300  is allocated as wavefronts of the same work-group can be in different CUs.
6301* All LDS operations of a CU are performed as wavefront wide operations in a
6302  global order and involve no caching. Completion is reported to a wavefront in
6303  execution order.
6304* The LDS memory has multiple request queues shared by the SIMDs of a
6305  CU. Therefore, the LDS operations performed by different wavefronts of a
6306  work-group can be reordered relative to each other, which can result in
6307  reordering the visibility of vector memory operations with respect to LDS
6308  operations of other wavefronts in the same work-group. A ``s_waitcnt
6309  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6310  vector memory operations between wavefronts of a work-group, but not between
6311  operations performed by the same wavefront.
6312* The vector memory operations are performed as wavefront wide operations and
6313  completion is reported to a wavefront in execution order. The exception is
6314  that ``flat_load/store/atomic`` instructions can report out of vector memory
6315  order if they access LDS memory, and out of LDS operation order if they access
6316  global memory.
6317* The vector memory operations access a single vector L1 cache shared by all
6318  SIMDs a CU. Therefore:
6319
6320  * No special action is required for coherence between the lanes of a single
6321    wavefront.
6322
6323  * No special action is required for coherence between wavefronts in the same
6324    work-group since they execute on the same CU. The exception is when in
6325    tgsplit execution mode as wavefronts of the same work-group can be in
6326    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6327    the following item.
6328
6329  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6330    executing in different work-groups as they may be executing on different
6331    CUs.
6332
6333* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6334  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6335  scalar operations are used in a restricted way so do not impact the memory
6336  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6337* The vector and scalar memory operations use an L2 cache shared by all CUs on
6338  the same agent.
6339
6340  * The L2 cache has independent channels to service disjoint ranges of virtual
6341    addresses.
6342  * Each CU has a separate request queue per channel. Therefore, the vector and
6343    scalar memory operations performed by wavefronts executing in different
6344    work-groups (which may be executing on different CUs), or the same
6345    work-group if executing in tgsplit mode, of an agent can be reordered
6346    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6347    synchronization between vector memory operations of different CUs. It
6348    ensures a previous vector memory operation has completed before executing a
6349    subsequent vector memory or LDS operation and so can be used to meet the
6350    requirements of acquire and release.
6351  * The L2 cache of one agent can be kept coherent with other agents by:
6352    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6353    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6354    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6355
6356    * Any local memory cache lines will be automatically invalidated by writes
6357      from CUs associated with other L2 caches, or writes from the CPU, due to
6358      the cache probe caused by coherent requests. Coherent requests are caused
6359      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6360      XGMI, and by PCIe requests that are configured to be coherent requests.
6361    * XGMI accesses from the CPU to local memory may be cached on the CPU.
6362      Subsequent access from the GPU will automatically invalidate or writeback
6363      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6364    * Since all work-groups on the same agent share the same L2, no L2
6365      invalidation or writeback is required for coherence.
6366    * To ensure coherence of local and remote memory writes of work-groups in
6367      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6368      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6369      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6370      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6371      remote fine grain memory) bypasses the L2, so both will never result in
6372      dirty L2 cache lines.
6373    * To ensure coherence of local and remote memory reads of work-groups in
6374      different agents a ``buffer_invl2`` is required. It will invalidate L2
6375      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6376      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6377      coarse memory) cause local reads to be invalidated by remote writes with
6378      with the PTE C-bit so these cache lines are not invalidated. Note that
6379      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6380      never result in L2 cache lines that need to be invalidated.
6381
6382  * PCIe access from the GPU to the CPU memory is kept coherent by using the
6383    MTYPE UC (uncached) which bypasses the L2.
6384
6385Scalar memory operations are only used to access memory that is proven to not
6386change during the execution of the kernel dispatch. This includes constant
6387address space and global address space for program scope ``const`` variables.
6388Therefore, the kernel machine code does not have to maintain the scalar cache to
6389ensure it is coherent with the vector caches. The scalar and vector caches are
6390invalidated between kernel dispatches by CP since constant address space data
6391may change between kernel dispatch executions. See
6392:ref:`amdgpu-amdhsa-memory-spaces`.
6393
6394The one exception is if scalar writes are used to spill SGPR registers. In this
6395case the AMDGPU backend ensures the memory location used to spill is never
6396accessed by vector memory operations at the same time. If scalar writes are used
6397then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6398return since the locations may be used for vector memory instructions by a
6399future wavefront that uses the same scratch area, or a function call that
6400creates a frame at the same address, respectively. There is no need for a
6401``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6402
6403For kernarg backing memory:
6404
6405* CP invalidates the L1 cache at the start of each kernel dispatch.
6406* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6407  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6408  cache. This also causes it to be treated as non-volatile and so is not
6409  invalidated by ``*_vol``.
6410* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6411  so the L2 cache will be coherent with the CPU and other agents.
6412
6413Scratch backing memory (which is used for the private address space) is accessed
6414with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6415only accessed by a single thread, and is always write-before-read, there is
6416never a need to invalidate these entries from the L1 cache. Hence all cache
6417invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6418
6419The code sequences used to implement the memory model for GFX90A are defined
6420in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6421
6422  .. table:: AMDHSA Memory Model Code Sequences GFX90A
6423     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6424
6425     ============ ============ ============== ========== ================================
6426     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6427                  Ordering     Sync Scope     Address    GFX90A
6428                                              Space
6429     ============ ============ ============== ========== ================================
6430     **Non-Atomic**
6431     ------------------------------------------------------------------------------------
6432     load         *none*       *none*         - global   - !volatile & !nontemporal
6433                                              - generic
6434                                              - private    1. buffer/global/flat_load
6435                                              - constant
6436                                                         - !volatile & nontemporal
6437
6438                                                           1. buffer/global/flat_load
6439                                                              glc=1 slc=1
6440
6441                                                         - volatile
6442
6443                                                           1. buffer/global/flat_load
6444                                                              glc=1
6445                                                           2. s_waitcnt vmcnt(0)
6446
6447                                                            - Must happen before
6448                                                              any following volatile
6449                                                              global/generic
6450                                                              load/store.
6451                                                            - Ensures that
6452                                                              volatile
6453                                                              operations to
6454                                                              different
6455                                                              addresses will not
6456                                                              be reordered by
6457                                                              hardware.
6458
6459     load         *none*       *none*         - local    1. ds_load
6460     store        *none*       *none*         - global   - !volatile & !nontemporal
6461                                              - generic
6462                                              - private    1. buffer/global/flat_store
6463                                              - constant
6464                                                         - !volatile & nontemporal
6465
6466                                                           1. buffer/global/flat_store
6467                                                              glc=1 slc=1
6468
6469                                                         - volatile
6470
6471                                                           1. buffer/global/flat_store
6472                                                           2. s_waitcnt vmcnt(0)
6473
6474                                                            - Must happen before
6475                                                              any following volatile
6476                                                              global/generic
6477                                                              load/store.
6478                                                            - Ensures that
6479                                                              volatile
6480                                                              operations to
6481                                                              different
6482                                                              addresses will not
6483                                                              be reordered by
6484                                                              hardware.
6485
6486     store        *none*       *none*         - local    1. ds_store
6487     **Unordered Atomic**
6488     ------------------------------------------------------------------------------------
6489     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6490     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6491     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6492     **Monotonic Atomic**
6493     ------------------------------------------------------------------------------------
6494     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6495                               - wavefront    - generic
6496     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6497                                              - generic     glc=1
6498
6499                                                           - If not TgSplit execution
6500                                                             mode, omit glc=1.
6501
6502     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6503                               - wavefront               local address space cannot
6504                               - workgroup               be used.*
6505
6506                                                         1. ds_load
6507     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6508                                              - generic     glc=1
6509     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6510                                              - generic     glc=1
6511     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6512                               - wavefront    - generic
6513                               - workgroup
6514                               - agent
6515     store atomic monotonic    - system       - global   1. buffer/global/flat_store
6516                                              - generic
6517     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6518                               - wavefront               local address space cannot
6519                               - workgroup               be used.*
6520
6521                                                         1. ds_store
6522     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6523                               - wavefront    - generic
6524                               - workgroup
6525                               - agent
6526     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6527                                              - generic
6528     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6529                               - wavefront               local address space cannot
6530                               - workgroup               be used.*
6531
6532                                                         1. ds_atomic
6533     **Acquire Atomic**
6534     ------------------------------------------------------------------------------------
6535     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6536                               - wavefront    - local
6537                                              - generic
6538     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6539
6540                                                           - If not TgSplit execution
6541                                                             mode, omit glc=1.
6542
6543                                                         2. s_waitcnt vmcnt(0)
6544
6545                                                           - If not TgSplit execution
6546                                                             mode, omit.
6547                                                           - Must happen before the
6548                                                             following buffer_wbinvl1_vol.
6549
6550                                                         3. buffer_wbinvl1_vol
6551
6552                                                           - If not TgSplit execution
6553                                                             mode, omit.
6554                                                           - Must happen before
6555                                                             any following
6556                                                             global/generic
6557                                                             load/load
6558                                                             atomic/store/store
6559                                                             atomic/atomicrmw.
6560                                                           - Ensures that
6561                                                             following
6562                                                             loads will not see
6563                                                             stale data.
6564
6565     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6566                                                         local address space cannot
6567                                                         be used.*
6568
6569                                                         1. ds_load
6570                                                         2. s_waitcnt lgkmcnt(0)
6571
6572                                                           - If OpenCL, omit.
6573                                                           - Must happen before
6574                                                             any following
6575                                                             global/generic
6576                                                             load/load
6577                                                             atomic/store/store
6578                                                             atomic/atomicrmw.
6579                                                           - Ensures any
6580                                                             following global
6581                                                             data read is no
6582                                                             older than the local load
6583                                                             atomic value being
6584                                                             acquired.
6585
6586     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6587
6588                                                           - If not TgSplit execution
6589                                                             mode, omit glc=1.
6590
6591                                                         2. s_waitcnt lgkm/vmcnt(0)
6592
6593                                                           - Use lgkmcnt(0) if not
6594                                                             TgSplit execution mode
6595                                                             and vmcnt(0) if TgSplit
6596                                                             execution mode.
6597                                                           - If OpenCL, omit lgkmcnt(0).
6598                                                           - Must happen before
6599                                                             the following
6600                                                             buffer_wbinvl1_vol and any
6601                                                             following global/generic
6602                                                             load/load
6603                                                             atomic/store/store
6604                                                             atomic/atomicrmw.
6605                                                           - Ensures any
6606                                                             following global
6607                                                             data read is no
6608                                                             older than a local load
6609                                                             atomic value being
6610                                                             acquired.
6611
6612                                                         3. buffer_wbinvl1_vol
6613
6614                                                           - If not TgSplit execution
6615                                                             mode, omit.
6616                                                           - Ensures that
6617                                                             following
6618                                                             loads will not see
6619                                                             stale data.
6620
6621     load atomic  acquire      - agent        - global   1. buffer/global_load
6622                                                            glc=1
6623                                                         2. s_waitcnt vmcnt(0)
6624
6625                                                           - Must happen before
6626                                                             following
6627                                                             buffer_wbinvl1_vol.
6628                                                           - Ensures the load
6629                                                             has completed
6630                                                             before invalidating
6631                                                             the cache.
6632
6633                                                         3. buffer_wbinvl1_vol
6634
6635                                                           - Must happen before
6636                                                             any following
6637                                                             global/generic
6638                                                             load/load
6639                                                             atomic/atomicrmw.
6640                                                           - Ensures that
6641                                                             following
6642                                                             loads will not see
6643                                                             stale global data.
6644
6645     load atomic  acquire      - system       - global   1. buffer/global/flat_load
6646                                                            glc=1
6647                                                         2. s_waitcnt vmcnt(0)
6648
6649                                                           - Must happen before
6650                                                             following buffer_invl2 and
6651                                                             buffer_wbinvl1_vol.
6652                                                           - Ensures the load
6653                                                             has completed
6654                                                             before invalidating
6655                                                             the cache.
6656
6657                                                         3. buffer_invl2;
6658                                                            buffer_wbinvl1_vol
6659
6660                                                           - Must happen before
6661                                                             any following
6662                                                             global/generic
6663                                                             load/load
6664                                                             atomic/atomicrmw.
6665                                                           - Ensures that
6666                                                             following
6667                                                             loads will not see
6668                                                             stale L1 global data,
6669                                                             nor see stale L2 MTYPE
6670                                                             NC global data.
6671                                                             MTYPE RW and CC memory will
6672                                                             never be stale in L2 due to
6673                                                             the memory probes.
6674
6675     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6676                                                         2. s_waitcnt vmcnt(0) &
6677                                                            lgkmcnt(0)
6678
6679                                                           - If TgSplit execution mode,
6680                                                             omit lgkmcnt(0).
6681                                                           - If OpenCL omit
6682                                                             lgkmcnt(0).
6683                                                           - Must happen before
6684                                                             following
6685                                                             buffer_wbinvl1_vol.
6686                                                           - Ensures the flat_load
6687                                                             has completed
6688                                                             before invalidating
6689                                                             the cache.
6690
6691                                                         3. buffer_wbinvl1_vol
6692
6693                                                           - Must happen before
6694                                                             any following
6695                                                             global/generic
6696                                                             load/load
6697                                                             atomic/atomicrmw.
6698                                                           - Ensures that
6699                                                             following loads
6700                                                             will not see stale
6701                                                             global data.
6702
6703     load atomic  acquire      - system       - generic  1. flat_load glc=1
6704                                                         2. s_waitcnt vmcnt(0) &
6705                                                            lgkmcnt(0)
6706
6707                                                           - If TgSplit execution mode,
6708                                                             omit lgkmcnt(0).
6709                                                           - If OpenCL omit
6710                                                             lgkmcnt(0).
6711                                                           - Must happen before
6712                                                             following
6713                                                             buffer_invl2 and
6714                                                             buffer_wbinvl1_vol.
6715                                                           - Ensures the flat_load
6716                                                             has completed
6717                                                             before invalidating
6718                                                             the caches.
6719
6720                                                         3. buffer_invl2;
6721                                                            buffer_wbinvl1_vol
6722
6723                                                           - Must happen before
6724                                                             any following
6725                                                             global/generic
6726                                                             load/load
6727                                                             atomic/atomicrmw.
6728                                                           - Ensures that
6729                                                             following
6730                                                             loads will not see
6731                                                             stale L1 global data,
6732                                                             nor see stale L2 MTYPE
6733                                                             NC global data.
6734                                                             MTYPE RW and CC memory will
6735                                                             never be stale in L2 due to
6736                                                             the memory probes.
6737
6738     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6739                               - wavefront    - generic
6740     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6741                               - wavefront               local address space cannot
6742                                                         be used.*
6743
6744                                                         1. ds_atomic
6745     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6746                                                         2. s_waitcnt vmcnt(0)
6747
6748                                                           - If not TgSplit execution
6749                                                             mode, omit.
6750                                                           - Must happen before the
6751                                                             following buffer_wbinvl1_vol.
6752                                                           - Ensures the atomicrmw
6753                                                             has completed
6754                                                             before invalidating
6755                                                             the cache.
6756
6757                                                         3. buffer_wbinvl1_vol
6758
6759                                                           - If not TgSplit execution
6760                                                             mode, omit.
6761                                                           - Must happen before
6762                                                             any following
6763                                                             global/generic
6764                                                             load/load
6765                                                             atomic/atomicrmw.
6766                                                           - Ensures that
6767                                                             following loads
6768                                                             will not see stale
6769                                                             global data.
6770
6771     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6772                                                         local address space cannot
6773                                                         be used.*
6774
6775                                                         1. ds_atomic
6776                                                         2. s_waitcnt lgkmcnt(0)
6777
6778                                                           - If OpenCL, omit.
6779                                                           - Must happen before
6780                                                             any following
6781                                                             global/generic
6782                                                             load/load
6783                                                             atomic/store/store
6784                                                             atomic/atomicrmw.
6785                                                           - Ensures any
6786                                                             following global
6787                                                             data read is no
6788                                                             older than the local
6789                                                             atomicrmw value
6790                                                             being acquired.
6791
6792     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6793                                                         2. s_waitcnt lgkm/vmcnt(0)
6794
6795                                                           - Use lgkmcnt(0) if not
6796                                                             TgSplit execution mode
6797                                                             and vmcnt(0) if TgSplit
6798                                                             execution mode.
6799                                                           - If OpenCL, omit lgkmcnt(0).
6800                                                           - Must happen before
6801                                                             the following
6802                                                             buffer_wbinvl1_vol and
6803                                                             any following
6804                                                             global/generic
6805                                                             load/load
6806                                                             atomic/store/store
6807                                                             atomic/atomicrmw.
6808                                                           - Ensures any
6809                                                             following global
6810                                                             data read is no
6811                                                             older than a local
6812                                                             atomicrmw value
6813                                                             being acquired.
6814
6815                                                         3. buffer_wbinvl1_vol
6816
6817                                                           - If not TgSplit execution
6818                                                             mode, omit.
6819                                                           - Ensures that
6820                                                             following
6821                                                             loads will not see
6822                                                             stale data.
6823
6824     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6825                                                         2. s_waitcnt vmcnt(0)
6826
6827                                                           - Must happen before
6828                                                             following
6829                                                             buffer_wbinvl1_vol.
6830                                                           - Ensures the
6831                                                             atomicrmw has
6832                                                             completed before
6833                                                             invalidating the
6834                                                             cache.
6835
6836                                                         3. buffer_wbinvl1_vol
6837
6838                                                           - Must happen before
6839                                                             any following
6840                                                             global/generic
6841                                                             load/load
6842                                                             atomic/atomicrmw.
6843                                                           - Ensures that
6844                                                             following loads
6845                                                             will not see stale
6846                                                             global data.
6847
6848     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6849                                                         2. s_waitcnt vmcnt(0)
6850
6851                                                           - Must happen before
6852                                                             following buffer_invl2 and
6853                                                             buffer_wbinvl1_vol.
6854                                                           - Ensures the
6855                                                             atomicrmw has
6856                                                             completed before
6857                                                             invalidating the
6858                                                             caches.
6859
6860                                                         3. buffer_invl2;
6861                                                            buffer_wbinvl1_vol
6862
6863                                                           - Must happen before
6864                                                             any following
6865                                                             global/generic
6866                                                             load/load
6867                                                             atomic/atomicrmw.
6868                                                           - Ensures that
6869                                                             following
6870                                                             loads will not see
6871                                                             stale L1 global data,
6872                                                             nor see stale L2 MTYPE
6873                                                             NC global data.
6874                                                             MTYPE RW and CC memory will
6875                                                             never be stale in L2 due to
6876                                                             the memory probes.
6877
6878     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6879                                                         2. s_waitcnt vmcnt(0) &
6880                                                            lgkmcnt(0)
6881
6882                                                           - If TgSplit execution mode,
6883                                                             omit lgkmcnt(0).
6884                                                           - If OpenCL, omit
6885                                                             lgkmcnt(0).
6886                                                           - Must happen before
6887                                                             following
6888                                                             buffer_wbinvl1_vol.
6889                                                           - Ensures the
6890                                                             atomicrmw has
6891                                                             completed before
6892                                                             invalidating the
6893                                                             cache.
6894
6895                                                         3. buffer_wbinvl1_vol
6896
6897                                                           - Must happen before
6898                                                             any following
6899                                                             global/generic
6900                                                             load/load
6901                                                             atomic/atomicrmw.
6902                                                           - Ensures that
6903                                                             following loads
6904                                                             will not see stale
6905                                                             global data.
6906
6907     atomicrmw    acquire      - system       - generic  1. flat_atomic
6908                                                         2. s_waitcnt vmcnt(0) &
6909                                                            lgkmcnt(0)
6910
6911                                                           - If TgSplit execution mode,
6912                                                             omit lgkmcnt(0).
6913                                                           - If OpenCL, omit
6914                                                             lgkmcnt(0).
6915                                                           - Must happen before
6916                                                             following
6917                                                             buffer_invl2 and
6918                                                             buffer_wbinvl1_vol.
6919                                                           - Ensures the
6920                                                             atomicrmw has
6921                                                             completed before
6922                                                             invalidating the
6923                                                             caches.
6924
6925                                                         3. buffer_invl2;
6926                                                            buffer_wbinvl1_vol
6927
6928                                                           - Must happen before
6929                                                             any following
6930                                                             global/generic
6931                                                             load/load
6932                                                             atomic/atomicrmw.
6933                                                           - Ensures that
6934                                                             following
6935                                                             loads will not see
6936                                                             stale L1 global data,
6937                                                             nor see stale L2 MTYPE
6938                                                             NC global data.
6939                                                             MTYPE RW and CC memory will
6940                                                             never be stale in L2 due to
6941                                                             the memory probes.
6942
6943     fence        acquire      - singlethread *none*     *none*
6944                               - wavefront
6945     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
6946
6947                                                           - Use lgkmcnt(0) if not
6948                                                             TgSplit execution mode
6949                                                             and vmcnt(0) if TgSplit
6950                                                             execution mode.
6951                                                           - If OpenCL and
6952                                                             address space is
6953                                                             not generic, omit
6954                                                             lgkmcnt(0).
6955                                                           - If OpenCL and
6956                                                             address space is
6957                                                             local, omit
6958                                                             vmcnt(0).
6959                                                           - However, since LLVM
6960                                                             currently has no
6961                                                             address space on
6962                                                             the fence need to
6963                                                             conservatively
6964                                                             always generate. If
6965                                                             fence had an
6966                                                             address space then
6967                                                             set to address
6968                                                             space of OpenCL
6969                                                             fence flag, or to
6970                                                             generic if both
6971                                                             local and global
6972                                                             flags are
6973                                                             specified.
6974                                                           - s_waitcnt vmcnt(0)
6975                                                             must happen after
6976                                                             any preceding
6977                                                             global/generic load
6978                                                             atomic/
6979                                                             atomicrmw
6980                                                             with an equal or
6981                                                             wider sync scope
6982                                                             and memory ordering
6983                                                             stronger than
6984                                                             unordered (this is
6985                                                             termed the
6986                                                             fence-paired-atomic).
6987                                                           - s_waitcnt lgkmcnt(0)
6988                                                             must happen after
6989                                                             any preceding
6990                                                             local/generic load
6991                                                             atomic/atomicrmw
6992                                                             with an equal or
6993                                                             wider sync scope
6994                                                             and memory ordering
6995                                                             stronger than
6996                                                             unordered (this is
6997                                                             termed the
6998                                                             fence-paired-atomic).
6999                                                           - Must happen before
7000                                                             the following
7001                                                             buffer_wbinvl1_vol and
7002                                                             any following
7003                                                             global/generic
7004                                                             load/load
7005                                                             atomic/store/store
7006                                                             atomic/atomicrmw.
7007                                                           - Ensures any
7008                                                             following global
7009                                                             data read is no
7010                                                             older than the
7011                                                             value read by the
7012                                                             fence-paired-atomic.
7013
7014                                                         2. buffer_wbinvl1_vol
7015
7016                                                           - If not TgSplit execution
7017                                                             mode, omit.
7018                                                           - Ensures that
7019                                                             following
7020                                                             loads will not see
7021                                                             stale data.
7022
7023     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7024                                                            vmcnt(0)
7025
7026                                                           - If TgSplit execution mode,
7027                                                             omit lgkmcnt(0).
7028                                                           - If OpenCL and
7029                                                             address space is
7030                                                             not generic, omit
7031                                                             lgkmcnt(0).
7032                                                           - However, since LLVM
7033                                                             currently has no
7034                                                             address space on
7035                                                             the fence need to
7036                                                             conservatively
7037                                                             always generate
7038                                                             (see comment for
7039                                                             previous fence).
7040                                                           - Could be split into
7041                                                             separate s_waitcnt
7042                                                             vmcnt(0) and
7043                                                             s_waitcnt
7044                                                             lgkmcnt(0) to allow
7045                                                             them to be
7046                                                             independently moved
7047                                                             according to the
7048                                                             following rules.
7049                                                           - s_waitcnt vmcnt(0)
7050                                                             must happen after
7051                                                             any preceding
7052                                                             global/generic load
7053                                                             atomic/atomicrmw
7054                                                             with an equal or
7055                                                             wider sync scope
7056                                                             and memory ordering
7057                                                             stronger than
7058                                                             unordered (this is
7059                                                             termed the
7060                                                             fence-paired-atomic).
7061                                                           - s_waitcnt lgkmcnt(0)
7062                                                             must happen after
7063                                                             any preceding
7064                                                             local/generic load
7065                                                             atomic/atomicrmw
7066                                                             with an equal or
7067                                                             wider sync scope
7068                                                             and memory ordering
7069                                                             stronger than
7070                                                             unordered (this is
7071                                                             termed the
7072                                                             fence-paired-atomic).
7073                                                           - Must happen before
7074                                                             the following
7075                                                             buffer_wbinvl1_vol.
7076                                                           - Ensures that the
7077                                                             fence-paired atomic
7078                                                             has completed
7079                                                             before invalidating
7080                                                             the
7081                                                             cache. Therefore
7082                                                             any following
7083                                                             locations read must
7084                                                             be no older than
7085                                                             the value read by
7086                                                             the
7087                                                             fence-paired-atomic.
7088
7089                                                         2. buffer_wbinvl1_vol
7090
7091                                                           - Must happen before any
7092                                                             following global/generic
7093                                                             load/load
7094                                                             atomic/store/store
7095                                                             atomic/atomicrmw.
7096                                                           - Ensures that
7097                                                             following loads
7098                                                             will not see stale
7099                                                             global data.
7100
7101     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
7102                                                            vmcnt(0)
7103
7104                                                           - If TgSplit execution mode,
7105                                                             omit lgkmcnt(0).
7106                                                           - If OpenCL and
7107                                                             address space is
7108                                                             not generic, omit
7109                                                             lgkmcnt(0).
7110                                                           - However, since LLVM
7111                                                             currently has no
7112                                                             address space on
7113                                                             the fence need to
7114                                                             conservatively
7115                                                             always generate
7116                                                             (see comment for
7117                                                             previous fence).
7118                                                           - Could be split into
7119                                                             separate s_waitcnt
7120                                                             vmcnt(0) and
7121                                                             s_waitcnt
7122                                                             lgkmcnt(0) to allow
7123                                                             them to be
7124                                                             independently moved
7125                                                             according to the
7126                                                             following rules.
7127                                                           - s_waitcnt vmcnt(0)
7128                                                             must happen after
7129                                                             any preceding
7130                                                             global/generic load
7131                                                             atomic/atomicrmw
7132                                                             with an equal or
7133                                                             wider sync scope
7134                                                             and memory ordering
7135                                                             stronger than
7136                                                             unordered (this is
7137                                                             termed the
7138                                                             fence-paired-atomic).
7139                                                           - s_waitcnt lgkmcnt(0)
7140                                                             must happen after
7141                                                             any preceding
7142                                                             local/generic load
7143                                                             atomic/atomicrmw
7144                                                             with an equal or
7145                                                             wider sync scope
7146                                                             and memory ordering
7147                                                             stronger than
7148                                                             unordered (this is
7149                                                             termed the
7150                                                             fence-paired-atomic).
7151                                                           - Must happen before
7152                                                             the following buffer_invl2 and
7153                                                             buffer_wbinvl1_vol.
7154                                                           - Ensures that the
7155                                                             fence-paired atomic
7156                                                             has completed
7157                                                             before invalidating
7158                                                             the
7159                                                             cache. Therefore
7160                                                             any following
7161                                                             locations read must
7162                                                             be no older than
7163                                                             the value read by
7164                                                             the
7165                                                             fence-paired-atomic.
7166
7167                                                         2. buffer_invl2;
7168                                                            buffer_wbinvl1_vol
7169
7170                                                           - Must happen before any
7171                                                             following global/generic
7172                                                             load/load
7173                                                             atomic/store/store
7174                                                             atomic/atomicrmw.
7175                                                           - Ensures that
7176                                                             following
7177                                                             loads will not see
7178                                                             stale L1 global data,
7179                                                             nor see stale L2 MTYPE
7180                                                             NC global data.
7181                                                             MTYPE RW and CC memory will
7182                                                             never be stale in L2 due to
7183                                                             the memory probes.
7184     **Release Atomic**
7185     ------------------------------------------------------------------------------------
7186     store atomic release      - singlethread - global   1. buffer/global/flat_store
7187                               - wavefront    - generic
7188     store atomic release      - singlethread - local    *If TgSplit execution mode,
7189                               - wavefront               local address space cannot
7190                                                         be used.*
7191
7192                                                         1. ds_store
7193     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7194                                              - generic
7195                                                           - Use lgkmcnt(0) if not
7196                                                             TgSplit execution mode
7197                                                             and vmcnt(0) if TgSplit
7198                                                             execution mode.
7199                                                           - If OpenCL, omit lgkmcnt(0).
7200                                                           - s_waitcnt vmcnt(0)
7201                                                             must happen after
7202                                                             any preceding
7203                                                             global/generic load/store/
7204                                                             load atomic/store atomic/
7205                                                             atomicrmw.
7206                                                           - s_waitcnt lgkmcnt(0)
7207                                                             must happen after
7208                                                             any preceding
7209                                                             local/generic
7210                                                             load/store/load
7211                                                             atomic/store
7212                                                             atomic/atomicrmw.
7213                                                           - Must happen before
7214                                                             the following
7215                                                             store.
7216                                                           - Ensures that all
7217                                                             memory operations
7218                                                             have
7219                                                             completed before
7220                                                             performing the
7221                                                             store that is being
7222                                                             released.
7223
7224                                                         2. buffer/global/flat_store
7225     store atomic release      - workgroup    - local    *If TgSplit execution mode,
7226                                                         local address space cannot
7227                                                         be used.*
7228
7229                                                         1. ds_store
7230     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7231                                              - generic     vmcnt(0)
7232
7233                                                           - If TgSplit execution mode,
7234                                                             omit lgkmcnt(0).
7235                                                           - If OpenCL and
7236                                                             address space is
7237                                                             not generic, omit
7238                                                             lgkmcnt(0).
7239                                                           - Could be split into
7240                                                             separate s_waitcnt
7241                                                             vmcnt(0) and
7242                                                             s_waitcnt
7243                                                             lgkmcnt(0) to allow
7244                                                             them to be
7245                                                             independently moved
7246                                                             according to the
7247                                                             following rules.
7248                                                           - s_waitcnt vmcnt(0)
7249                                                             must happen after
7250                                                             any preceding
7251                                                             global/generic
7252                                                             load/store/load
7253                                                             atomic/store
7254                                                             atomic/atomicrmw.
7255                                                           - s_waitcnt lgkmcnt(0)
7256                                                             must happen after
7257                                                             any preceding
7258                                                             local/generic
7259                                                             load/store/load
7260                                                             atomic/store
7261                                                             atomic/atomicrmw.
7262                                                           - Must happen before
7263                                                             the following
7264                                                             store.
7265                                                           - Ensures that all
7266                                                             memory operations
7267                                                             to memory have
7268                                                             completed before
7269                                                             performing the
7270                                                             store that is being
7271                                                             released.
7272
7273                                                         2. buffer/global/flat_store
7274     store atomic release      - system       - global   1. buffer_wbl2
7275                                              - generic
7276                                                           - Must happen before
7277                                                             following s_waitcnt.
7278                                                           - Performs L2 writeback to
7279                                                             ensure previous
7280                                                             global/generic
7281                                                             store/atomicrmw are
7282                                                             visible at system scope.
7283
7284                                                         2. s_waitcnt lgkmcnt(0) &
7285                                                            vmcnt(0)
7286
7287                                                           - If TgSplit execution mode,
7288                                                             omit lgkmcnt(0).
7289                                                           - If OpenCL and
7290                                                             address space is
7291                                                             not generic, omit
7292                                                             lgkmcnt(0).
7293                                                           - Could be split into
7294                                                             separate s_waitcnt
7295                                                             vmcnt(0) and
7296                                                             s_waitcnt
7297                                                             lgkmcnt(0) to allow
7298                                                             them to be
7299                                                             independently moved
7300                                                             according to the
7301                                                             following rules.
7302                                                           - s_waitcnt vmcnt(0)
7303                                                             must happen after any
7304                                                             preceding
7305                                                             global/generic
7306                                                             load/store/load
7307                                                             atomic/store
7308                                                             atomic/atomicrmw.
7309                                                           - s_waitcnt lgkmcnt(0)
7310                                                             must happen after any
7311                                                             preceding
7312                                                             local/generic
7313                                                             load/store/load
7314                                                             atomic/store
7315                                                             atomic/atomicrmw.
7316                                                           - Must happen before
7317                                                             the following
7318                                                             store.
7319                                                           - Ensures that all
7320                                                             memory operations
7321                                                             to memory and the L2
7322                                                             writeback have
7323                                                             completed before
7324                                                             performing the
7325                                                             store that is being
7326                                                             released.
7327
7328                                                         3. buffer/global/flat_store
7329     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7330                               - wavefront    - generic
7331     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7332                               - wavefront               local address space cannot
7333                                                         be used.*
7334
7335                                                         1. ds_atomic
7336     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7337                                              - generic
7338                                                           - Use lgkmcnt(0) if not
7339                                                             TgSplit execution mode
7340                                                             and vmcnt(0) if TgSplit
7341                                                             execution mode.
7342                                                           - If OpenCL, omit
7343                                                             lgkmcnt(0).
7344                                                           - s_waitcnt vmcnt(0)
7345                                                             must happen after
7346                                                             any preceding
7347                                                             global/generic load/store/
7348                                                             load atomic/store atomic/
7349                                                             atomicrmw.
7350                                                           - s_waitcnt lgkmcnt(0)
7351                                                             must happen after
7352                                                             any preceding
7353                                                             local/generic
7354                                                             load/store/load
7355                                                             atomic/store
7356                                                             atomic/atomicrmw.
7357                                                           - Must happen before
7358                                                             the following
7359                                                             atomicrmw.
7360                                                           - Ensures that all
7361                                                             memory operations
7362                                                             have
7363                                                             completed before
7364                                                             performing the
7365                                                             atomicrmw that is
7366                                                             being released.
7367
7368                                                         2. buffer/global/flat_atomic
7369     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7370                                                         local address space cannot
7371                                                         be used.*
7372
7373                                                         1. ds_atomic
7374     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7375                                              - generic     vmcnt(0)
7376
7377                                                           - If TgSplit execution mode,
7378                                                             omit lgkmcnt(0).
7379                                                           - If OpenCL, omit
7380                                                             lgkmcnt(0).
7381                                                           - Could be split into
7382                                                             separate s_waitcnt
7383                                                             vmcnt(0) and
7384                                                             s_waitcnt
7385                                                             lgkmcnt(0) to allow
7386                                                             them to be
7387                                                             independently moved
7388                                                             according to the
7389                                                             following rules.
7390                                                           - s_waitcnt vmcnt(0)
7391                                                             must happen after
7392                                                             any preceding
7393                                                             global/generic
7394                                                             load/store/load
7395                                                             atomic/store
7396                                                             atomic/atomicrmw.
7397                                                           - s_waitcnt lgkmcnt(0)
7398                                                             must happen after
7399                                                             any preceding
7400                                                             local/generic
7401                                                             load/store/load
7402                                                             atomic/store
7403                                                             atomic/atomicrmw.
7404                                                           - Must happen before
7405                                                             the following
7406                                                             atomicrmw.
7407                                                           - Ensures that all
7408                                                             memory operations
7409                                                             to global and local
7410                                                             have completed
7411                                                             before performing
7412                                                             the atomicrmw that
7413                                                             is being released.
7414
7415                                                         2. buffer/global/flat_atomic
7416     atomicrmw    release      - system       - global   1. buffer_wbl2
7417                                              - generic
7418                                                           - Must happen before
7419                                                             following s_waitcnt.
7420                                                           - Performs L2 writeback to
7421                                                             ensure previous
7422                                                             global/generic
7423                                                             store/atomicrmw are
7424                                                             visible at system scope.
7425
7426                                                         2. s_waitcnt lgkmcnt(0) &
7427                                                            vmcnt(0)
7428
7429                                                           - If TgSplit execution mode,
7430                                                             omit lgkmcnt(0).
7431                                                           - If OpenCL, omit
7432                                                             lgkmcnt(0).
7433                                                           - Could be split into
7434                                                             separate s_waitcnt
7435                                                             vmcnt(0) and
7436                                                             s_waitcnt
7437                                                             lgkmcnt(0) to allow
7438                                                             them to be
7439                                                             independently moved
7440                                                             according to the
7441                                                             following rules.
7442                                                           - s_waitcnt vmcnt(0)
7443                                                             must happen after
7444                                                             any preceding
7445                                                             global/generic
7446                                                             load/store/load
7447                                                             atomic/store
7448                                                             atomic/atomicrmw.
7449                                                           - s_waitcnt lgkmcnt(0)
7450                                                             must happen after
7451                                                             any preceding
7452                                                             local/generic
7453                                                             load/store/load
7454                                                             atomic/store
7455                                                             atomic/atomicrmw.
7456                                                           - Must happen before
7457                                                             the following
7458                                                             atomicrmw.
7459                                                           - Ensures that all
7460                                                             memory operations
7461                                                             to memory and the L2
7462                                                             writeback have
7463                                                             completed before
7464                                                             performing the
7465                                                             store that is being
7466                                                             released.
7467
7468                                                         3. buffer/global/flat_atomic
7469     fence        release      - singlethread *none*     *none*
7470                               - wavefront
7471     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7472
7473                                                           - Use lgkmcnt(0) if not
7474                                                             TgSplit execution mode
7475                                                             and vmcnt(0) if TgSplit
7476                                                             execution mode.
7477                                                           - If OpenCL and
7478                                                             address space is
7479                                                             not generic, omit
7480                                                             lgkmcnt(0).
7481                                                           - If OpenCL and
7482                                                             address space is
7483                                                             local, omit
7484                                                             vmcnt(0).
7485                                                           - However, since LLVM
7486                                                             currently has no
7487                                                             address space on
7488                                                             the fence need to
7489                                                             conservatively
7490                                                             always generate. If
7491                                                             fence had an
7492                                                             address space then
7493                                                             set to address
7494                                                             space of OpenCL
7495                                                             fence flag, or to
7496                                                             generic if both
7497                                                             local and global
7498                                                             flags are
7499                                                             specified.
7500                                                           - s_waitcnt vmcnt(0)
7501                                                             must happen after
7502                                                             any preceding
7503                                                             global/generic
7504                                                             load/store/
7505                                                             load atomic/store atomic/
7506                                                             atomicrmw.
7507                                                           - s_waitcnt lgkmcnt(0)
7508                                                             must happen after
7509                                                             any preceding
7510                                                             local/generic
7511                                                             load/load
7512                                                             atomic/store/store
7513                                                             atomic/atomicrmw.
7514                                                           - Must happen before
7515                                                             any following store
7516                                                             atomic/atomicrmw
7517                                                             with an equal or
7518                                                             wider sync scope
7519                                                             and memory ordering
7520                                                             stronger than
7521                                                             unordered (this is
7522                                                             termed the
7523                                                             fence-paired-atomic).
7524                                                           - Ensures that all
7525                                                             memory operations
7526                                                             have
7527                                                             completed before
7528                                                             performing the
7529                                                             following
7530                                                             fence-paired-atomic.
7531
7532     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7533                                                            vmcnt(0)
7534
7535                                                           - If TgSplit execution mode,
7536                                                             omit lgkmcnt(0).
7537                                                           - If OpenCL and
7538                                                             address space is
7539                                                             not generic, omit
7540                                                             lgkmcnt(0).
7541                                                           - If OpenCL and
7542                                                             address space is
7543                                                             local, omit
7544                                                             vmcnt(0).
7545                                                           - However, since LLVM
7546                                                             currently has no
7547                                                             address space on
7548                                                             the fence need to
7549                                                             conservatively
7550                                                             always generate. If
7551                                                             fence had an
7552                                                             address space then
7553                                                             set to address
7554                                                             space of OpenCL
7555                                                             fence flag, or to
7556                                                             generic if both
7557                                                             local and global
7558                                                             flags are
7559                                                             specified.
7560                                                           - Could be split into
7561                                                             separate s_waitcnt
7562                                                             vmcnt(0) and
7563                                                             s_waitcnt
7564                                                             lgkmcnt(0) to allow
7565                                                             them to be
7566                                                             independently moved
7567                                                             according to the
7568                                                             following rules.
7569                                                           - s_waitcnt vmcnt(0)
7570                                                             must happen after
7571                                                             any preceding
7572                                                             global/generic
7573                                                             load/store/load
7574                                                             atomic/store
7575                                                             atomic/atomicrmw.
7576                                                           - s_waitcnt lgkmcnt(0)
7577                                                             must happen after
7578                                                             any preceding
7579                                                             local/generic
7580                                                             load/store/load
7581                                                             atomic/store
7582                                                             atomic/atomicrmw.
7583                                                           - Must happen before
7584                                                             any following store
7585                                                             atomic/atomicrmw
7586                                                             with an equal or
7587                                                             wider sync scope
7588                                                             and memory ordering
7589                                                             stronger than
7590                                                             unordered (this is
7591                                                             termed the
7592                                                             fence-paired-atomic).
7593                                                           - Ensures that all
7594                                                             memory operations
7595                                                             have
7596                                                             completed before
7597                                                             performing the
7598                                                             following
7599                                                             fence-paired-atomic.
7600
7601     fence        release      - system       *none*     1. buffer_wbl2
7602
7603                                                           - If OpenCL and
7604                                                             address space is
7605                                                             local, omit.
7606                                                           - Must happen before
7607                                                             following s_waitcnt.
7608                                                           - Performs L2 writeback to
7609                                                             ensure previous
7610                                                             global/generic
7611                                                             store/atomicrmw are
7612                                                             visible at system scope.
7613
7614                                                         2. s_waitcnt lgkmcnt(0) &
7615                                                            vmcnt(0)
7616
7617                                                           - If TgSplit execution mode,
7618                                                             omit lgkmcnt(0).
7619                                                           - If OpenCL and
7620                                                             address space is
7621                                                             not generic, omit
7622                                                             lgkmcnt(0).
7623                                                           - If OpenCL and
7624                                                             address space is
7625                                                             local, omit
7626                                                             vmcnt(0).
7627                                                           - However, since LLVM
7628                                                             currently has no
7629                                                             address space on
7630                                                             the fence need to
7631                                                             conservatively
7632                                                             always generate. If
7633                                                             fence had an
7634                                                             address space then
7635                                                             set to address
7636                                                             space of OpenCL
7637                                                             fence flag, or to
7638                                                             generic if both
7639                                                             local and global
7640                                                             flags are
7641                                                             specified.
7642                                                           - Could be split into
7643                                                             separate s_waitcnt
7644                                                             vmcnt(0) and
7645                                                             s_waitcnt
7646                                                             lgkmcnt(0) to allow
7647                                                             them to be
7648                                                             independently moved
7649                                                             according to the
7650                                                             following rules.
7651                                                           - s_waitcnt vmcnt(0)
7652                                                             must happen after
7653                                                             any preceding
7654                                                             global/generic
7655                                                             load/store/load
7656                                                             atomic/store
7657                                                             atomic/atomicrmw.
7658                                                           - s_waitcnt lgkmcnt(0)
7659                                                             must happen after
7660                                                             any preceding
7661                                                             local/generic
7662                                                             load/store/load
7663                                                             atomic/store
7664                                                             atomic/atomicrmw.
7665                                                           - Must happen before
7666                                                             any following store
7667                                                             atomic/atomicrmw
7668                                                             with an equal or
7669                                                             wider sync scope
7670                                                             and memory ordering
7671                                                             stronger than
7672                                                             unordered (this is
7673                                                             termed the
7674                                                             fence-paired-atomic).
7675                                                           - Ensures that all
7676                                                             memory operations
7677                                                             have
7678                                                             completed before
7679                                                             performing the
7680                                                             following
7681                                                             fence-paired-atomic.
7682
7683     **Acquire-Release Atomic**
7684     ------------------------------------------------------------------------------------
7685     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7686                               - wavefront    - generic
7687     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7688                               - wavefront               local address space cannot
7689                                                         be used.*
7690
7691                                                         1. ds_atomic
7692     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7693
7694                                                           - Use lgkmcnt(0) if not
7695                                                             TgSplit execution mode
7696                                                             and vmcnt(0) if TgSplit
7697                                                             execution mode.
7698                                                           - If OpenCL, omit
7699                                                             lgkmcnt(0).
7700                                                           - Must happen after
7701                                                             any preceding
7702                                                             local/generic
7703                                                             load/store/load
7704                                                             atomic/store
7705                                                             atomic/atomicrmw.
7706                                                           - s_waitcnt vmcnt(0)
7707                                                             must happen after
7708                                                             any preceding
7709                                                             global/generic load/store/
7710                                                             load atomic/store atomic/
7711                                                             atomicrmw.
7712                                                           - s_waitcnt lgkmcnt(0)
7713                                                             must happen after
7714                                                             any preceding
7715                                                             local/generic
7716                                                             load/store/load
7717                                                             atomic/store
7718                                                             atomic/atomicrmw.
7719                                                           - Must happen before
7720                                                             the following
7721                                                             atomicrmw.
7722                                                           - Ensures that all
7723                                                             memory operations
7724                                                             have
7725                                                             completed before
7726                                                             performing the
7727                                                             atomicrmw that is
7728                                                             being released.
7729
7730                                                         2. buffer/global_atomic
7731                                                         3. s_waitcnt vmcnt(0)
7732
7733                                                           - If not TgSplit execution
7734                                                             mode, omit.
7735                                                           - Must happen before
7736                                                             the following
7737                                                             buffer_wbinvl1_vol.
7738                                                           - Ensures any
7739                                                             following global
7740                                                             data read is no
7741                                                             older than the
7742                                                             atomicrmw value
7743                                                             being acquired.
7744
7745                                                         4. buffer_wbinvl1_vol
7746
7747                                                           - If not TgSplit execution
7748                                                             mode, omit.
7749                                                           - Ensures that
7750                                                             following
7751                                                             loads will not see
7752                                                             stale data.
7753
7754     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7755                                                         local address space cannot
7756                                                         be used.*
7757
7758                                                         1. ds_atomic
7759                                                         2. s_waitcnt lgkmcnt(0)
7760
7761                                                           - If OpenCL, omit.
7762                                                           - Must happen before
7763                                                             any following
7764                                                             global/generic
7765                                                             load/load
7766                                                             atomic/store/store
7767                                                             atomic/atomicrmw.
7768                                                           - Ensures any
7769                                                             following global
7770                                                             data read is no
7771                                                             older than the local load
7772                                                             atomic value being
7773                                                             acquired.
7774
7775     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7776
7777                                                           - Use lgkmcnt(0) if not
7778                                                             TgSplit execution mode
7779                                                             and vmcnt(0) if TgSplit
7780                                                             execution mode.
7781                                                           - If OpenCL, omit
7782                                                             lgkmcnt(0).
7783                                                           - s_waitcnt vmcnt(0)
7784                                                             must happen after
7785                                                             any preceding
7786                                                             global/generic load/store/
7787                                                             load atomic/store atomic/
7788                                                             atomicrmw.
7789                                                           - s_waitcnt lgkmcnt(0)
7790                                                             must happen after
7791                                                             any preceding
7792                                                             local/generic
7793                                                             load/store/load
7794                                                             atomic/store
7795                                                             atomic/atomicrmw.
7796                                                           - Must happen before
7797                                                             the following
7798                                                             atomicrmw.
7799                                                           - Ensures that all
7800                                                             memory operations
7801                                                             have
7802                                                             completed before
7803                                                             performing the
7804                                                             atomicrmw that is
7805                                                             being released.
7806
7807                                                         2. flat_atomic
7808                                                         3. s_waitcnt lgkmcnt(0) &
7809                                                            vmcnt(0)
7810
7811                                                           - If not TgSplit execution
7812                                                             mode, omit vmcnt(0).
7813                                                           - If OpenCL, omit
7814                                                             lgkmcnt(0).
7815                                                           - Must happen before
7816                                                             the following
7817                                                             buffer_wbinvl1_vol and
7818                                                             any following
7819                                                             global/generic
7820                                                             load/load
7821                                                             atomic/store/store
7822                                                             atomic/atomicrmw.
7823                                                           - Ensures any
7824                                                             following global
7825                                                             data read is no
7826                                                             older than a local load
7827                                                             atomic value being
7828                                                             acquired.
7829
7830                                                         3. buffer_wbinvl1_vol
7831
7832                                                           - If not TgSplit execution
7833                                                             mode, omit.
7834                                                           - Ensures that
7835                                                             following
7836                                                             loads will not see
7837                                                             stale data.
7838
7839     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7840                                                            vmcnt(0)
7841
7842                                                           - If TgSplit execution mode,
7843                                                             omit lgkmcnt(0).
7844                                                           - If OpenCL, omit
7845                                                             lgkmcnt(0).
7846                                                           - Could be split into
7847                                                             separate s_waitcnt
7848                                                             vmcnt(0) and
7849                                                             s_waitcnt
7850                                                             lgkmcnt(0) to allow
7851                                                             them to be
7852                                                             independently moved
7853                                                             according to the
7854                                                             following rules.
7855                                                           - s_waitcnt vmcnt(0)
7856                                                             must happen after
7857                                                             any preceding
7858                                                             global/generic
7859                                                             load/store/load
7860                                                             atomic/store
7861                                                             atomic/atomicrmw.
7862                                                           - s_waitcnt lgkmcnt(0)
7863                                                             must happen after
7864                                                             any preceding
7865                                                             local/generic
7866                                                             load/store/load
7867                                                             atomic/store
7868                                                             atomic/atomicrmw.
7869                                                           - Must happen before
7870                                                             the following
7871                                                             atomicrmw.
7872                                                           - Ensures that all
7873                                                             memory operations
7874                                                             to global have
7875                                                             completed before
7876                                                             performing the
7877                                                             atomicrmw that is
7878                                                             being released.
7879
7880                                                         2. buffer/global_atomic
7881                                                         3. s_waitcnt vmcnt(0)
7882
7883                                                           - Must happen before
7884                                                             following
7885                                                             buffer_wbinvl1_vol.
7886                                                           - Ensures the
7887                                                             atomicrmw has
7888                                                             completed before
7889                                                             invalidating the
7890                                                             cache.
7891
7892                                                         4. buffer_wbinvl1_vol
7893
7894                                                           - Must happen before
7895                                                             any following
7896                                                             global/generic
7897                                                             load/load
7898                                                             atomic/atomicrmw.
7899                                                           - Ensures that
7900                                                             following loads
7901                                                             will not see stale
7902                                                             global data.
7903
7904     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7905
7906                                                           - Must happen before
7907                                                             following s_waitcnt.
7908                                                           - Performs L2 writeback to
7909                                                             ensure previous
7910                                                             global/generic
7911                                                             store/atomicrmw are
7912                                                             visible at system scope.
7913
7914                                                         2. s_waitcnt lgkmcnt(0) &
7915                                                            vmcnt(0)
7916
7917                                                           - If TgSplit execution mode,
7918                                                             omit lgkmcnt(0).
7919                                                           - If OpenCL, omit
7920                                                             lgkmcnt(0).
7921                                                           - Could be split into
7922                                                             separate s_waitcnt
7923                                                             vmcnt(0) and
7924                                                             s_waitcnt
7925                                                             lgkmcnt(0) to allow
7926                                                             them to be
7927                                                             independently moved
7928                                                             according to the
7929                                                             following rules.
7930                                                           - s_waitcnt vmcnt(0)
7931                                                             must happen after
7932                                                             any preceding
7933                                                             global/generic
7934                                                             load/store/load
7935                                                             atomic/store
7936                                                             atomic/atomicrmw.
7937                                                           - s_waitcnt lgkmcnt(0)
7938                                                             must happen after
7939                                                             any preceding
7940                                                             local/generic
7941                                                             load/store/load
7942                                                             atomic/store
7943                                                             atomic/atomicrmw.
7944                                                           - Must happen before
7945                                                             the following
7946                                                             atomicrmw.
7947                                                           - Ensures that all
7948                                                             memory operations
7949                                                             to global and L2 writeback
7950                                                             have completed before
7951                                                             performing the
7952                                                             atomicrmw that is
7953                                                             being released.
7954
7955                                                         3. buffer/global_atomic
7956                                                         4. s_waitcnt vmcnt(0)
7957
7958                                                           - Must happen before
7959                                                             following buffer_invl2 and
7960                                                             buffer_wbinvl1_vol.
7961                                                           - Ensures the
7962                                                             atomicrmw has
7963                                                             completed before
7964                                                             invalidating the
7965                                                             caches.
7966
7967                                                         5. buffer_invl2;
7968                                                            buffer_wbinvl1_vol
7969
7970                                                           - Must happen before
7971                                                             any following
7972                                                             global/generic
7973                                                             load/load
7974                                                             atomic/atomicrmw.
7975                                                           - Ensures that
7976                                                             following
7977                                                             loads will not see
7978                                                             stale L1 global data,
7979                                                             nor see stale L2 MTYPE
7980                                                             NC global data.
7981                                                             MTYPE RW and CC memory will
7982                                                             never be stale in L2 due to
7983                                                             the memory probes.
7984
7985     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
7986                                                            vmcnt(0)
7987
7988                                                           - If TgSplit execution mode,
7989                                                             omit lgkmcnt(0).
7990                                                           - If OpenCL, omit
7991                                                             lgkmcnt(0).
7992                                                           - Could be split into
7993                                                             separate s_waitcnt
7994                                                             vmcnt(0) and
7995                                                             s_waitcnt
7996                                                             lgkmcnt(0) to allow
7997                                                             them to be
7998                                                             independently moved
7999                                                             according to the
8000                                                             following rules.
8001                                                           - s_waitcnt vmcnt(0)
8002                                                             must happen after
8003                                                             any preceding
8004                                                             global/generic
8005                                                             load/store/load
8006                                                             atomic/store
8007                                                             atomic/atomicrmw.
8008                                                           - s_waitcnt lgkmcnt(0)
8009                                                             must happen after
8010                                                             any preceding
8011                                                             local/generic
8012                                                             load/store/load
8013                                                             atomic/store
8014                                                             atomic/atomicrmw.
8015                                                           - Must happen before
8016                                                             the following
8017                                                             atomicrmw.
8018                                                           - Ensures that all
8019                                                             memory operations
8020                                                             to global have
8021                                                             completed before
8022                                                             performing the
8023                                                             atomicrmw that is
8024                                                             being released.
8025
8026                                                         2. flat_atomic
8027                                                         3. s_waitcnt vmcnt(0) &
8028                                                            lgkmcnt(0)
8029
8030                                                           - If TgSplit execution mode,
8031                                                             omit lgkmcnt(0).
8032                                                           - If OpenCL, omit
8033                                                             lgkmcnt(0).
8034                                                           - Must happen before
8035                                                             following
8036                                                             buffer_wbinvl1_vol.
8037                                                           - Ensures the
8038                                                             atomicrmw has
8039                                                             completed before
8040                                                             invalidating the
8041                                                             cache.
8042
8043                                                         4. buffer_wbinvl1_vol
8044
8045                                                           - Must happen before
8046                                                             any following
8047                                                             global/generic
8048                                                             load/load
8049                                                             atomic/atomicrmw.
8050                                                           - Ensures that
8051                                                             following loads
8052                                                             will not see stale
8053                                                             global data.
8054
8055     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
8056
8057                                                           - Must happen before
8058                                                             following s_waitcnt.
8059                                                           - Performs L2 writeback to
8060                                                             ensure previous
8061                                                             global/generic
8062                                                             store/atomicrmw are
8063                                                             visible at system scope.
8064
8065                                                         2. s_waitcnt lgkmcnt(0) &
8066                                                            vmcnt(0)
8067
8068                                                           - If TgSplit execution mode,
8069                                                             omit lgkmcnt(0).
8070                                                           - If OpenCL, omit
8071                                                             lgkmcnt(0).
8072                                                           - Could be split into
8073                                                             separate s_waitcnt
8074                                                             vmcnt(0) and
8075                                                             s_waitcnt
8076                                                             lgkmcnt(0) to allow
8077                                                             them to be
8078                                                             independently moved
8079                                                             according to the
8080                                                             following rules.
8081                                                           - s_waitcnt vmcnt(0)
8082                                                             must happen after
8083                                                             any preceding
8084                                                             global/generic
8085                                                             load/store/load
8086                                                             atomic/store
8087                                                             atomic/atomicrmw.
8088                                                           - s_waitcnt lgkmcnt(0)
8089                                                             must happen after
8090                                                             any preceding
8091                                                             local/generic
8092                                                             load/store/load
8093                                                             atomic/store
8094                                                             atomic/atomicrmw.
8095                                                           - Must happen before
8096                                                             the following
8097                                                             atomicrmw.
8098                                                           - Ensures that all
8099                                                             memory operations
8100                                                             to global and L2 writeback
8101                                                             have completed before
8102                                                             performing the
8103                                                             atomicrmw that is
8104                                                             being released.
8105
8106                                                         3. flat_atomic
8107                                                         4. s_waitcnt vmcnt(0) &
8108                                                            lgkmcnt(0)
8109
8110                                                           - If TgSplit execution mode,
8111                                                             omit lgkmcnt(0).
8112                                                           - If OpenCL, omit
8113                                                             lgkmcnt(0).
8114                                                           - Must happen before
8115                                                             following buffer_invl2 and
8116                                                             buffer_wbinvl1_vol.
8117                                                           - Ensures the
8118                                                             atomicrmw has
8119                                                             completed before
8120                                                             invalidating the
8121                                                             caches.
8122
8123                                                         5. buffer_invl2;
8124                                                            buffer_wbinvl1_vol
8125
8126                                                           - Must happen before
8127                                                             any following
8128                                                             global/generic
8129                                                             load/load
8130                                                             atomic/atomicrmw.
8131                                                           - Ensures that
8132                                                             following
8133                                                             loads will not see
8134                                                             stale L1 global data,
8135                                                             nor see stale L2 MTYPE
8136                                                             NC global data.
8137                                                             MTYPE RW and CC memory will
8138                                                             never be stale in L2 due to
8139                                                             the memory probes.
8140
8141     fence        acq_rel      - singlethread *none*     *none*
8142                               - wavefront
8143     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8144
8145                                                           - Use lgkmcnt(0) if not
8146                                                             TgSplit execution mode
8147                                                             and vmcnt(0) if TgSplit
8148                                                             execution mode.
8149                                                           - If OpenCL and
8150                                                             address space is
8151                                                             not generic, omit
8152                                                             lgkmcnt(0).
8153                                                           - If OpenCL and
8154                                                             address space is
8155                                                             local, omit
8156                                                             vmcnt(0).
8157                                                           - However,
8158                                                             since LLVM
8159                                                             currently has no
8160                                                             address space on
8161                                                             the fence need to
8162                                                             conservatively
8163                                                             always generate
8164                                                             (see comment for
8165                                                             previous fence).
8166                                                           - s_waitcnt vmcnt(0)
8167                                                             must happen after
8168                                                             any preceding
8169                                                             global/generic
8170                                                             load/store/
8171                                                             load atomic/store atomic/
8172                                                             atomicrmw.
8173                                                           - s_waitcnt lgkmcnt(0)
8174                                                             must happen after
8175                                                             any preceding
8176                                                             local/generic
8177                                                             load/load
8178                                                             atomic/store/store
8179                                                             atomic/atomicrmw.
8180                                                           - Must happen before
8181                                                             any following
8182                                                             global/generic
8183                                                             load/load
8184                                                             atomic/store/store
8185                                                             atomic/atomicrmw.
8186                                                           - Ensures that all
8187                                                             memory operations
8188                                                             have
8189                                                             completed before
8190                                                             performing any
8191                                                             following global
8192                                                             memory operations.
8193                                                           - Ensures that the
8194                                                             preceding
8195                                                             local/generic load
8196                                                             atomic/atomicrmw
8197                                                             with an equal or
8198                                                             wider sync scope
8199                                                             and memory ordering
8200                                                             stronger than
8201                                                             unordered (this is
8202                                                             termed the
8203                                                             acquire-fence-paired-atomic)
8204                                                             has completed
8205                                                             before following
8206                                                             global memory
8207                                                             operations. This
8208                                                             satisfies the
8209                                                             requirements of
8210                                                             acquire.
8211                                                           - Ensures that all
8212                                                             previous memory
8213                                                             operations have
8214                                                             completed before a
8215                                                             following
8216                                                             local/generic store
8217                                                             atomic/atomicrmw
8218                                                             with an equal or
8219                                                             wider sync scope
8220                                                             and memory ordering
8221                                                             stronger than
8222                                                             unordered (this is
8223                                                             termed the
8224                                                             release-fence-paired-atomic).
8225                                                             This satisfies the
8226                                                             requirements of
8227                                                             release.
8228                                                           - Must happen before
8229                                                             the following
8230                                                             buffer_wbinvl1_vol.
8231                                                           - Ensures that the
8232                                                             acquire-fence-paired
8233                                                             atomic has completed
8234                                                             before invalidating
8235                                                             the
8236                                                             cache. Therefore
8237                                                             any following
8238                                                             locations read must
8239                                                             be no older than
8240                                                             the value read by
8241                                                             the
8242                                                             acquire-fence-paired-atomic.
8243
8244                                                         2. buffer_wbinvl1_vol
8245
8246                                                           - If not TgSplit execution
8247                                                             mode, omit.
8248                                                           - Ensures that
8249                                                             following
8250                                                             loads will not see
8251                                                             stale data.
8252
8253     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8254                                                            vmcnt(0)
8255
8256                                                           - If TgSplit execution mode,
8257                                                             omit lgkmcnt(0).
8258                                                           - If OpenCL and
8259                                                             address space is
8260                                                             not generic, omit
8261                                                             lgkmcnt(0).
8262                                                           - However, since LLVM
8263                                                             currently has no
8264                                                             address space on
8265                                                             the fence need to
8266                                                             conservatively
8267                                                             always generate
8268                                                             (see comment for
8269                                                             previous fence).
8270                                                           - Could be split into
8271                                                             separate s_waitcnt
8272                                                             vmcnt(0) and
8273                                                             s_waitcnt
8274                                                             lgkmcnt(0) to allow
8275                                                             them to be
8276                                                             independently moved
8277                                                             according to the
8278                                                             following rules.
8279                                                           - s_waitcnt vmcnt(0)
8280                                                             must happen after
8281                                                             any preceding
8282                                                             global/generic
8283                                                             load/store/load
8284                                                             atomic/store
8285                                                             atomic/atomicrmw.
8286                                                           - s_waitcnt lgkmcnt(0)
8287                                                             must happen after
8288                                                             any preceding
8289                                                             local/generic
8290                                                             load/store/load
8291                                                             atomic/store
8292                                                             atomic/atomicrmw.
8293                                                           - Must happen before
8294                                                             the following
8295                                                             buffer_wbinvl1_vol.
8296                                                           - Ensures that the
8297                                                             preceding
8298                                                             global/local/generic
8299                                                             load
8300                                                             atomic/atomicrmw
8301                                                             with an equal or
8302                                                             wider sync scope
8303                                                             and memory ordering
8304                                                             stronger than
8305                                                             unordered (this is
8306                                                             termed the
8307                                                             acquire-fence-paired-atomic)
8308                                                             has completed
8309                                                             before invalidating
8310                                                             the cache. This
8311                                                             satisfies the
8312                                                             requirements of
8313                                                             acquire.
8314                                                           - Ensures that all
8315                                                             previous memory
8316                                                             operations have
8317                                                             completed before a
8318                                                             following
8319                                                             global/local/generic
8320                                                             store
8321                                                             atomic/atomicrmw
8322                                                             with an equal or
8323                                                             wider sync scope
8324                                                             and memory ordering
8325                                                             stronger than
8326                                                             unordered (this is
8327                                                             termed the
8328                                                             release-fence-paired-atomic).
8329                                                             This satisfies the
8330                                                             requirements of
8331                                                             release.
8332
8333                                                         2. buffer_wbinvl1_vol
8334
8335                                                           - Must happen before
8336                                                             any following
8337                                                             global/generic
8338                                                             load/load
8339                                                             atomic/store/store
8340                                                             atomic/atomicrmw.
8341                                                           - Ensures that
8342                                                             following loads
8343                                                             will not see stale
8344                                                             global data. This
8345                                                             satisfies the
8346                                                             requirements of
8347                                                             acquire.
8348
8349     fence        acq_rel      - system       *none*     1. buffer_wbl2
8350
8351                                                           - If OpenCL and
8352                                                             address space is
8353                                                             local, omit.
8354                                                           - Must happen before
8355                                                             following s_waitcnt.
8356                                                           - Performs L2 writeback to
8357                                                             ensure previous
8358                                                             global/generic
8359                                                             store/atomicrmw are
8360                                                             visible at system scope.
8361
8362                                                         2. s_waitcnt lgkmcnt(0) &
8363                                                            vmcnt(0)
8364
8365                                                           - If TgSplit execution mode,
8366                                                             omit lgkmcnt(0).
8367                                                           - If OpenCL and
8368                                                             address space is
8369                                                             not generic, omit
8370                                                             lgkmcnt(0).
8371                                                           - However, since LLVM
8372                                                             currently has no
8373                                                             address space on
8374                                                             the fence need to
8375                                                             conservatively
8376                                                             always generate
8377                                                             (see comment for
8378                                                             previous fence).
8379                                                           - Could be split into
8380                                                             separate s_waitcnt
8381                                                             vmcnt(0) and
8382                                                             s_waitcnt
8383                                                             lgkmcnt(0) to allow
8384                                                             them to be
8385                                                             independently moved
8386                                                             according to the
8387                                                             following rules.
8388                                                           - s_waitcnt vmcnt(0)
8389                                                             must happen after
8390                                                             any preceding
8391                                                             global/generic
8392                                                             load/store/load
8393                                                             atomic/store
8394                                                             atomic/atomicrmw.
8395                                                           - s_waitcnt lgkmcnt(0)
8396                                                             must happen after
8397                                                             any preceding
8398                                                             local/generic
8399                                                             load/store/load
8400                                                             atomic/store
8401                                                             atomic/atomicrmw.
8402                                                           - Must happen before
8403                                                             the following buffer_invl2 and
8404                                                             buffer_wbinvl1_vol.
8405                                                           - Ensures that the
8406                                                             preceding
8407                                                             global/local/generic
8408                                                             load
8409                                                             atomic/atomicrmw
8410                                                             with an equal or
8411                                                             wider sync scope
8412                                                             and memory ordering
8413                                                             stronger than
8414                                                             unordered (this is
8415                                                             termed the
8416                                                             acquire-fence-paired-atomic)
8417                                                             has completed
8418                                                             before invalidating
8419                                                             the cache. This
8420                                                             satisfies the
8421                                                             requirements of
8422                                                             acquire.
8423                                                           - Ensures that all
8424                                                             previous memory
8425                                                             operations have
8426                                                             completed before a
8427                                                             following
8428                                                             global/local/generic
8429                                                             store
8430                                                             atomic/atomicrmw
8431                                                             with an equal or
8432                                                             wider sync scope
8433                                                             and memory ordering
8434                                                             stronger than
8435                                                             unordered (this is
8436                                                             termed the
8437                                                             release-fence-paired-atomic).
8438                                                             This satisfies the
8439                                                             requirements of
8440                                                             release.
8441
8442                                                         3.  buffer_invl2;
8443                                                             buffer_wbinvl1_vol
8444
8445                                                           - Must happen before
8446                                                             any following
8447                                                             global/generic
8448                                                             load/load
8449                                                             atomic/store/store
8450                                                             atomic/atomicrmw.
8451                                                           - Ensures that
8452                                                             following
8453                                                             loads will not see
8454                                                             stale L1 global data,
8455                                                             nor see stale L2 MTYPE
8456                                                             NC global data.
8457                                                             MTYPE RW and CC memory will
8458                                                             never be stale in L2 due to
8459                                                             the memory probes.
8460
8461     **Sequential Consistent Atomic**
8462     ------------------------------------------------------------------------------------
8463     load atomic  seq_cst      - singlethread - global   *Same as corresponding
8464                               - wavefront    - local    load atomic acquire,
8465                                              - generic  except must generate
8466                                                         all instructions even
8467                                                         for OpenCL.*
8468     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8469                                              - generic
8470                                                           - Use lgkmcnt(0) if not
8471                                                             TgSplit execution mode
8472                                                             and vmcnt(0) if TgSplit
8473                                                             execution mode.
8474                                                           - s_waitcnt lgkmcnt(0) must
8475                                                             happen after
8476                                                             preceding
8477                                                             local/generic load
8478                                                             atomic/store
8479                                                             atomic/atomicrmw
8480                                                             with memory
8481                                                             ordering of seq_cst
8482                                                             and with equal or
8483                                                             wider sync scope.
8484                                                             (Note that seq_cst
8485                                                             fences have their
8486                                                             own s_waitcnt
8487                                                             lgkmcnt(0) and so do
8488                                                             not need to be
8489                                                             considered.)
8490                                                           - s_waitcnt vmcnt(0)
8491                                                             must happen after
8492                                                             preceding
8493                                                             global/generic load
8494                                                             atomic/store
8495                                                             atomic/atomicrmw
8496                                                             with memory
8497                                                             ordering of seq_cst
8498                                                             and with equal or
8499                                                             wider sync scope.
8500                                                             (Note that seq_cst
8501                                                             fences have their
8502                                                             own s_waitcnt
8503                                                             vmcnt(0) and so do
8504                                                             not need to be
8505                                                             considered.)
8506                                                           - Ensures any
8507                                                             preceding
8508                                                             sequential
8509                                                             consistent global/local
8510                                                             memory instructions
8511                                                             have completed
8512                                                             before executing
8513                                                             this sequentially
8514                                                             consistent
8515                                                             instruction. This
8516                                                             prevents reordering
8517                                                             a seq_cst store
8518                                                             followed by a
8519                                                             seq_cst load. (Note
8520                                                             that seq_cst is
8521                                                             stronger than
8522                                                             acquire/release as
8523                                                             the reordering of
8524                                                             load acquire
8525                                                             followed by a store
8526                                                             release is
8527                                                             prevented by the
8528                                                             s_waitcnt of
8529                                                             the release, but
8530                                                             there is nothing
8531                                                             preventing a store
8532                                                             release followed by
8533                                                             load acquire from
8534                                                             completing out of
8535                                                             order. The s_waitcnt
8536                                                             could be placed after
8537                                                             seq_store or before
8538                                                             the seq_load. We
8539                                                             choose the load to
8540                                                             make the s_waitcnt be
8541                                                             as late as possible
8542                                                             so that the store
8543                                                             may have already
8544                                                             completed.)
8545
8546                                                         2. *Following
8547                                                            instructions same as
8548                                                            corresponding load
8549                                                            atomic acquire,
8550                                                            except must generate
8551                                                            all instructions even
8552                                                            for OpenCL.*
8553     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8554                                                         local address space cannot
8555                                                         be used.*
8556
8557                                                         *Same as corresponding
8558                                                         load atomic acquire,
8559                                                         except must generate
8560                                                         all instructions even
8561                                                         for OpenCL.*
8562
8563     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8564                               - system       - generic     vmcnt(0)
8565
8566                                                           - If TgSplit execution mode,
8567                                                             omit lgkmcnt(0).
8568                                                           - Could be split into
8569                                                             separate s_waitcnt
8570                                                             vmcnt(0)
8571                                                             and s_waitcnt
8572                                                             lgkmcnt(0) to allow
8573                                                             them to be
8574                                                             independently moved
8575                                                             according to the
8576                                                             following rules.
8577                                                           - s_waitcnt lgkmcnt(0)
8578                                                             must happen after
8579                                                             preceding
8580                                                             global/generic load
8581                                                             atomic/store
8582                                                             atomic/atomicrmw
8583                                                             with memory
8584                                                             ordering of seq_cst
8585                                                             and with equal or
8586                                                             wider sync scope.
8587                                                             (Note that seq_cst
8588                                                             fences have their
8589                                                             own s_waitcnt
8590                                                             lgkmcnt(0) and so do
8591                                                             not need to be
8592                                                             considered.)
8593                                                           - s_waitcnt vmcnt(0)
8594                                                             must happen after
8595                                                             preceding
8596                                                             global/generic load
8597                                                             atomic/store
8598                                                             atomic/atomicrmw
8599                                                             with memory
8600                                                             ordering of seq_cst
8601                                                             and with equal or
8602                                                             wider sync scope.
8603                                                             (Note that seq_cst
8604                                                             fences have their
8605                                                             own s_waitcnt
8606                                                             vmcnt(0) and so do
8607                                                             not need to be
8608                                                             considered.)
8609                                                           - Ensures any
8610                                                             preceding
8611                                                             sequential
8612                                                             consistent global
8613                                                             memory instructions
8614                                                             have completed
8615                                                             before executing
8616                                                             this sequentially
8617                                                             consistent
8618                                                             instruction. This
8619                                                             prevents reordering
8620                                                             a seq_cst store
8621                                                             followed by a
8622                                                             seq_cst load. (Note
8623                                                             that seq_cst is
8624                                                             stronger than
8625                                                             acquire/release as
8626                                                             the reordering of
8627                                                             load acquire
8628                                                             followed by a store
8629                                                             release is
8630                                                             prevented by the
8631                                                             s_waitcnt of
8632                                                             the release, but
8633                                                             there is nothing
8634                                                             preventing a store
8635                                                             release followed by
8636                                                             load acquire from
8637                                                             completing out of
8638                                                             order. The s_waitcnt
8639                                                             could be placed after
8640                                                             seq_store or before
8641                                                             the seq_load. We
8642                                                             choose the load to
8643                                                             make the s_waitcnt be
8644                                                             as late as possible
8645                                                             so that the store
8646                                                             may have already
8647                                                             completed.)
8648
8649                                                         2. *Following
8650                                                            instructions same as
8651                                                            corresponding load
8652                                                            atomic acquire,
8653                                                            except must generate
8654                                                            all instructions even
8655                                                            for OpenCL.*
8656     store atomic seq_cst      - singlethread - global   *Same as corresponding
8657                               - wavefront    - local    store atomic release,
8658                               - workgroup    - generic  except must generate
8659                               - agent                   all instructions even
8660                               - system                  for OpenCL.*
8661     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8662                               - wavefront    - local    atomicrmw acq_rel,
8663                               - workgroup    - generic  except must generate
8664                               - agent                   all instructions even
8665                               - system                  for OpenCL.*
8666     fence        seq_cst      - singlethread *none*     *Same as corresponding
8667                               - wavefront               fence acq_rel,
8668                               - workgroup               except must generate
8669                               - agent                   all instructions even
8670                               - system                  for OpenCL.*
8671     ============ ============ ============== ========== ================================
8672
8673.. _amdgpu-amdhsa-memory-model-gfx940:
8674
8675Memory Model GFX940
8676+++++++++++++++++++
8677
8678For GFX940:
8679
8680* Each agent has multiple shader arrays (SA).
8681* Each SA has multiple compute units (CU).
8682* Each CU has multiple SIMDs that execute wavefronts.
8683* The wavefronts for a single work-group are executed in the same CU but may be
8684  executed by different SIMDs. The exception is when in tgsplit execution mode
8685  when the wavefronts may be executed by different SIMDs in different CUs.
8686* Each CU has a single LDS memory shared by the wavefronts of the work-groups
8687  executing on it. The exception is when in tgsplit execution mode when no LDS
8688  is allocated as wavefronts of the same work-group can be in different CUs.
8689* All LDS operations of a CU are performed as wavefront wide operations in a
8690  global order and involve no caching. Completion is reported to a wavefront in
8691  execution order.
8692* The LDS memory has multiple request queues shared by the SIMDs of a
8693  CU. Therefore, the LDS operations performed by different wavefronts of a
8694  work-group can be reordered relative to each other, which can result in
8695  reordering the visibility of vector memory operations with respect to LDS
8696  operations of other wavefronts in the same work-group. A ``s_waitcnt
8697  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8698  vector memory operations between wavefronts of a work-group, but not between
8699  operations performed by the same wavefront.
8700* The vector memory operations are performed as wavefront wide operations and
8701  completion is reported to a wavefront in execution order. The exception is
8702  that ``flat_load/store/atomic`` instructions can report out of vector memory
8703  order if they access LDS memory, and out of LDS operation order if they access
8704  global memory.
8705* The vector memory operations access a single vector L1 cache shared by all
8706  SIMDs a CU. Therefore:
8707
8708  * No special action is required for coherence between the lanes of a single
8709    wavefront.
8710
8711  * No special action is required for coherence between wavefronts in the same
8712    work-group since they execute on the same CU. The exception is when in
8713    tgsplit execution mode as wavefronts of the same work-group can be in
8714    different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8715    the L1 cache.
8716
8717  * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8718    between wavefronts executing in different work-groups as they may be
8719    executing on different CUs.
8720
8721  * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8722    Therefore, they do not use the sc0 bit for coherence and instead use it to
8723    indicate if the instruction returns the original value being updated. They
8724    do use sc1 to indicate system or agent scope coherence.
8725
8726* The scalar memory operations access a scalar L1 cache shared by all wavefronts
8727  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8728  scalar operations are used in a restricted way so do not impact the memory
8729  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8730* The vector and scalar memory operations use an L2 cache.
8731
8732  * The gfx940 can be configured as a number of smaller agents with each having
8733    a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8734    larger agents with groups of CUs on each agent each sharing separate L2
8735    caches.
8736  * The L2 cache has independent channels to service disjoint ranges of virtual
8737    addresses.
8738  * Each CU has a separate request queue per channel for its associated L2.
8739    Therefore, the vector and scalar memory operations performed by wavefronts
8740    executing with different L1 caches and the same L2 cache can be reordered
8741    relative to each other.
8742  * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8743    vector memory operations of different CUs. It ensures a previous vector
8744    memory operation has completed before executing a subsequent vector memory
8745    or LDS operation and so can be used to meet the requirements of acquire and
8746    release.
8747  * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8748    (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8749    the PTE C-bit set for memory not local to the L2.
8750
8751    * Any local memory cache lines will be automatically invalidated by writes
8752      from CUs associated with other L2 caches, or writes from the CPU, due to
8753      the cache probe caused by the PTE C-bit.
8754    * XGMI accesses from the CPU to local memory may be cached on the CPU.
8755      Subsequent access from the GPU will automatically invalidate or writeback
8756      the CPU cache due to the L2 probe filter.
8757    * To ensure coherence of local memory writes of CUs with different L1 caches
8758      in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8759      agent is configured to have a single L2, or will writeback dirty L2 cache
8760      lines if configured to have multiple L2 caches.
8761    * To ensure coherence of local memory writes of CUs in different agents a
8762      ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8763    * To ensure coherence of local memory reads of CUs with different L1 caches
8764      in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8765      agent is configured to have a single L2, or will invalidate non-local L2
8766      cache lines if configured to have multiple L2 caches.
8767    * To ensure coherence of local memory reads of CUs in different agents a
8768      ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8769      lines if configured to have multiple L2 caches.
8770
8771  * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8772    UC (uncached) which bypasses the L2.
8773
8774Scalar memory operations are only used to access memory that is proven to not
8775change during the execution of the kernel dispatch. This includes constant
8776address space and global address space for program scope ``const`` variables.
8777Therefore, the kernel machine code does not have to maintain the scalar cache to
8778ensure it is coherent with the vector caches. The scalar and vector caches are
8779invalidated between kernel dispatches by CP since constant address space data
8780may change between kernel dispatch executions. See
8781:ref:`amdgpu-amdhsa-memory-spaces`.
8782
8783The one exception is if scalar writes are used to spill SGPR registers. In this
8784case the AMDGPU backend ensures the memory location used to spill is never
8785accessed by vector memory operations at the same time. If scalar writes are used
8786then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8787return since the locations may be used for vector memory instructions by a
8788future wavefront that uses the same scratch area, or a function call that
8789creates a frame at the same address, respectively. There is no need for a
8790``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8791
8792For kernarg backing memory:
8793
8794* CP invalidates the L1 cache at the start of each kernel dispatch.
8795* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8796  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8797  cache. This also causes it to be treated as non-volatile and so is not
8798  invalidated by ``*_vol``.
8799* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8800  so the L2 cache will be coherent with the CPU and other agents.
8801
8802Scratch backing memory (which is used for the private address space) is accessed
8803with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8804only accessed by a single thread, and is always write-before-read, there is
8805never a need to invalidate these entries from the L1 cache. Hence all cache
8806invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8807
8808The code sequences used to implement the memory model for GFX940 are defined
8809in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8810
8811  .. table:: AMDHSA Memory Model Code Sequences GFX940
8812     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8813
8814     ============ ============ ============== ========== ================================
8815     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8816                  Ordering     Sync Scope     Address    GFX940
8817                                              Space
8818     ============ ============ ============== ========== ================================
8819     **Non-Atomic**
8820     ------------------------------------------------------------------------------------
8821     load         *none*       *none*         - global   - !volatile & !nontemporal
8822                                              - generic
8823                                              - private    1. buffer/global/flat_load
8824                                              - constant
8825                                                         - !volatile & nontemporal
8826
8827                                                           1. buffer/global/flat_load
8828                                                              nt=1
8829
8830                                                         - volatile
8831
8832                                                           1. buffer/global/flat_load
8833                                                              sc0=1 sc1=1
8834                                                           2. s_waitcnt vmcnt(0)
8835
8836                                                            - Must happen before
8837                                                              any following volatile
8838                                                              global/generic
8839                                                              load/store.
8840                                                            - Ensures that
8841                                                              volatile
8842                                                              operations to
8843                                                              different
8844                                                              addresses will not
8845                                                              be reordered by
8846                                                              hardware.
8847
8848     load         *none*       *none*         - local    1. ds_load
8849     store        *none*       *none*         - global   - !volatile & !nontemporal
8850                                              - generic
8851                                              - private    1. buffer/global/flat_store
8852                                              - constant
8853                                                         - !volatile & nontemporal
8854
8855                                                           1. buffer/global/flat_store
8856                                                              nt=1
8857
8858                                                         - volatile
8859
8860                                                           1. buffer/global/flat_store
8861                                                              sc0=1 sc1=1
8862                                                           2. s_waitcnt vmcnt(0)
8863
8864                                                            - Must happen before
8865                                                              any following volatile
8866                                                              global/generic
8867                                                              load/store.
8868                                                            - Ensures that
8869                                                              volatile
8870                                                              operations to
8871                                                              different
8872                                                              addresses will not
8873                                                              be reordered by
8874                                                              hardware.
8875
8876     store        *none*       *none*         - local    1. ds_store
8877     **Unordered Atomic**
8878     ------------------------------------------------------------------------------------
8879     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8880     store atomic unordered    *any*          *any*      *Same as non-atomic*.
8881     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8882     **Monotonic Atomic**
8883     ------------------------------------------------------------------------------------
8884     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8885                               - wavefront    - generic
8886     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8887                                              - generic     sc0=1
8888     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
8889                               - wavefront               local address space cannot
8890                               - workgroup               be used.*
8891
8892                                                         1. ds_load
8893     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8894                                              - generic     sc1=1
8895     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
8896                                              - generic     sc0=1 sc1=1
8897     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8898                               - wavefront    - generic
8899     store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
8900                                              - generic     sc0=1
8901     store atomic monotonic    - agent        - global   1. buffer/global/flat_store
8902                                              - generic     sc1=1
8903     store atomic monotonic    - system       - global   1. buffer/global/flat_store
8904                                              - generic     sc0=1 sc1=1
8905     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
8906                               - wavefront               local address space cannot
8907                               - workgroup               be used.*
8908
8909                                                         1. ds_store
8910     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
8911                               - wavefront    - generic
8912                               - workgroup
8913                               - agent
8914     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
8915                                              - generic     sc1=1
8916     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
8917                               - wavefront               local address space cannot
8918                               - workgroup               be used.*
8919
8920                                                         1. ds_atomic
8921     **Acquire Atomic**
8922     ------------------------------------------------------------------------------------
8923     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
8924                               - wavefront    - local
8925                                              - generic
8926     load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
8927                                                         2. s_waitcnt vmcnt(0)
8928
8929                                                           - If not TgSplit execution
8930                                                             mode, omit.
8931                                                           - Must happen before the
8932                                                             following buffer_inv.
8933
8934                                                         3. buffer_inv sc0=1
8935
8936                                                           - If not TgSplit execution
8937                                                             mode, omit.
8938                                                           - Must happen before
8939                                                             any following
8940                                                             global/generic
8941                                                             load/load
8942                                                             atomic/store/store
8943                                                             atomic/atomicrmw.
8944                                                           - Ensures that
8945                                                             following
8946                                                             loads will not see
8947                                                             stale data.
8948
8949     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
8950                                                         local address space cannot
8951                                                         be used.*
8952
8953                                                         1. ds_load
8954                                                         2. s_waitcnt lgkmcnt(0)
8955
8956                                                           - If OpenCL, omit.
8957                                                           - Must happen before
8958                                                             any following
8959                                                             global/generic
8960                                                             load/load
8961                                                             atomic/store/store
8962                                                             atomic/atomicrmw.
8963                                                           - Ensures any
8964                                                             following global
8965                                                             data read is no
8966                                                             older than the local load
8967                                                             atomic value being
8968                                                             acquired.
8969
8970     load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
8971                                                         2. s_waitcnt lgkm/vmcnt(0)
8972
8973                                                           - Use lgkmcnt(0) if not
8974                                                             TgSplit execution mode
8975                                                             and vmcnt(0) if TgSplit
8976                                                             execution mode.
8977                                                           - If OpenCL, omit lgkmcnt(0).
8978                                                           - Must happen before
8979                                                             the following
8980                                                             buffer_inv and any
8981                                                             following global/generic
8982                                                             load/load
8983                                                             atomic/store/store
8984                                                             atomic/atomicrmw.
8985                                                           - Ensures any
8986                                                             following global
8987                                                             data read is no
8988                                                             older than a local load
8989                                                             atomic value being
8990                                                             acquired.
8991
8992                                                         3. buffer_inv sc0=1
8993
8994                                                           - If not TgSplit execution
8995                                                             mode, omit.
8996                                                           - Ensures that
8997                                                             following
8998                                                             loads will not see
8999                                                             stale data.
9000
9001     load atomic  acquire      - agent        - global   1. buffer/global_load
9002                                                            sc1=1
9003                                                         2. s_waitcnt vmcnt(0)
9004
9005                                                           - Must happen before
9006                                                             following
9007                                                             buffer_inv.
9008                                                           - Ensures the load
9009                                                             has completed
9010                                                             before invalidating
9011                                                             the cache.
9012
9013                                                         3. buffer_inv sc1=1
9014
9015                                                           - Must happen before
9016                                                             any following
9017                                                             global/generic
9018                                                             load/load
9019                                                             atomic/atomicrmw.
9020                                                           - Ensures that
9021                                                             following
9022                                                             loads will not see
9023                                                             stale global data.
9024
9025     load atomic  acquire      - system       - global   1. buffer/global/flat_load
9026                                                            sc0=1 sc1=1
9027                                                         2. s_waitcnt vmcnt(0)
9028
9029                                                           - Must happen before
9030                                                             following
9031                                                             buffer_inv.
9032                                                           - Ensures the load
9033                                                             has completed
9034                                                             before invalidating
9035                                                             the cache.
9036
9037                                                         3. buffer_inv sc0=1 sc1=1
9038
9039                                                           - Must happen before
9040                                                             any following
9041                                                             global/generic
9042                                                             load/load
9043                                                             atomic/atomicrmw.
9044                                                           - Ensures that
9045                                                             following
9046                                                             loads will not see
9047                                                             stale MTYPE NC global data.
9048                                                             MTYPE RW and CC memory will
9049                                                             never be stale due to the
9050                                                             memory probes.
9051
9052     load atomic  acquire      - agent        - generic  1. flat_load sc1=1
9053                                                         2. s_waitcnt vmcnt(0) &
9054                                                            lgkmcnt(0)
9055
9056                                                           - If TgSplit execution mode,
9057                                                             omit lgkmcnt(0).
9058                                                           - If OpenCL omit
9059                                                             lgkmcnt(0).
9060                                                           - Must happen before
9061                                                             following
9062                                                             buffer_inv.
9063                                                           - Ensures the flat_load
9064                                                             has completed
9065                                                             before invalidating
9066                                                             the cache.
9067
9068                                                         3. buffer_inv sc1=1
9069
9070                                                           - Must happen before
9071                                                             any following
9072                                                             global/generic
9073                                                             load/load
9074                                                             atomic/atomicrmw.
9075                                                           - Ensures that
9076                                                             following loads
9077                                                             will not see stale
9078                                                             global data.
9079
9080     load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
9081                                                         2. s_waitcnt vmcnt(0) &
9082                                                            lgkmcnt(0)
9083
9084                                                           - If TgSplit execution mode,
9085                                                             omit lgkmcnt(0).
9086                                                           - If OpenCL omit
9087                                                             lgkmcnt(0).
9088                                                           - Must happen before
9089                                                             the following
9090                                                             buffer_inv.
9091                                                           - Ensures the flat_load
9092                                                             has completed
9093                                                             before invalidating
9094                                                             the caches.
9095
9096                                                         3. buffer_inv sc0=1 sc1=1
9097
9098                                                           - Must happen before
9099                                                             any following
9100                                                             global/generic
9101                                                             load/load
9102                                                             atomic/atomicrmw.
9103                                                           - Ensures that
9104                                                             following
9105                                                             loads will not see
9106                                                             stale MTYPE NC global data.
9107                                                             MTYPE RW and CC memory will
9108                                                             never be stale due to the
9109                                                             memory probes.
9110
9111     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
9112                               - wavefront    - generic
9113     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
9114                               - wavefront               local address space cannot
9115                                                         be used.*
9116
9117                                                         1. ds_atomic
9118     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
9119                                                         2. s_waitcnt vmcnt(0)
9120
9121                                                           - If not TgSplit execution
9122                                                             mode, omit.
9123                                                           - Must happen before the
9124                                                             following buffer_inv.
9125                                                           - Ensures the atomicrmw
9126                                                             has completed
9127                                                             before invalidating
9128                                                             the cache.
9129
9130                                                         3. buffer_inv sc0=1
9131
9132                                                           - If not TgSplit execution
9133                                                             mode, omit.
9134                                                           - Must happen before
9135                                                             any following
9136                                                             global/generic
9137                                                             load/load
9138                                                             atomic/atomicrmw.
9139                                                           - Ensures that
9140                                                             following loads
9141                                                             will not see stale
9142                                                             global data.
9143
9144     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
9145                                                         local address space cannot
9146                                                         be used.*
9147
9148                                                         1. ds_atomic
9149                                                         2. s_waitcnt lgkmcnt(0)
9150
9151                                                           - If OpenCL, omit.
9152                                                           - Must happen before
9153                                                             any following
9154                                                             global/generic
9155                                                             load/load
9156                                                             atomic/store/store
9157                                                             atomic/atomicrmw.
9158                                                           - Ensures any
9159                                                             following global
9160                                                             data read is no
9161                                                             older than the local
9162                                                             atomicrmw value
9163                                                             being acquired.
9164
9165     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
9166                                                         2. s_waitcnt lgkm/vmcnt(0)
9167
9168                                                           - Use lgkmcnt(0) if not
9169                                                             TgSplit execution mode
9170                                                             and vmcnt(0) if TgSplit
9171                                                             execution mode.
9172                                                           - If OpenCL, omit lgkmcnt(0).
9173                                                           - Must happen before
9174                                                             the following
9175                                                             buffer_inv and
9176                                                             any following
9177                                                             global/generic
9178                                                             load/load
9179                                                             atomic/store/store
9180                                                             atomic/atomicrmw.
9181                                                           - Ensures any
9182                                                             following global
9183                                                             data read is no
9184                                                             older than a local
9185                                                             atomicrmw value
9186                                                             being acquired.
9187
9188                                                         3. buffer_inv sc0=1
9189
9190                                                           - If not TgSplit execution
9191                                                             mode, omit.
9192                                                           - Ensures that
9193                                                             following
9194                                                             loads will not see
9195                                                             stale data.
9196
9197     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
9198                                                         2. s_waitcnt vmcnt(0)
9199
9200                                                           - Must happen before
9201                                                             following
9202                                                             buffer_inv.
9203                                                           - Ensures the
9204                                                             atomicrmw has
9205                                                             completed before
9206                                                             invalidating the
9207                                                             cache.
9208
9209                                                         3. buffer_inv sc1=1
9210
9211                                                           - Must happen before
9212                                                             any following
9213                                                             global/generic
9214                                                             load/load
9215                                                             atomic/atomicrmw.
9216                                                           - Ensures that
9217                                                             following loads
9218                                                             will not see stale
9219                                                             global data.
9220
9221     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
9222                                                            sc1=1
9223                                                         2. s_waitcnt vmcnt(0)
9224
9225                                                           - Must happen before
9226                                                             following
9227                                                             buffer_inv.
9228                                                           - Ensures the
9229                                                             atomicrmw has
9230                                                             completed before
9231                                                             invalidating the
9232                                                             caches.
9233
9234                                                         3. buffer_inv sc0=1 sc1=1
9235
9236                                                           - Must happen before
9237                                                             any following
9238                                                             global/generic
9239                                                             load/load
9240                                                             atomic/atomicrmw.
9241                                                           - Ensures that
9242                                                             following
9243                                                             loads will not see
9244                                                             stale MTYPE NC global data.
9245                                                             MTYPE RW and CC memory will
9246                                                             never be stale due to the
9247                                                             memory probes.
9248
9249     atomicrmw    acquire      - agent        - generic  1. flat_atomic
9250                                                         2. s_waitcnt vmcnt(0) &
9251                                                            lgkmcnt(0)
9252
9253                                                           - If TgSplit execution mode,
9254                                                             omit lgkmcnt(0).
9255                                                           - If OpenCL, omit
9256                                                             lgkmcnt(0).
9257                                                           - Must happen before
9258                                                             following
9259                                                             buffer_inv.
9260                                                           - Ensures the
9261                                                             atomicrmw has
9262                                                             completed before
9263                                                             invalidating the
9264                                                             cache.
9265
9266                                                         3. buffer_inv sc1=1
9267
9268                                                           - Must happen before
9269                                                             any following
9270                                                             global/generic
9271                                                             load/load
9272                                                             atomic/atomicrmw.
9273                                                           - Ensures that
9274                                                             following loads
9275                                                             will not see stale
9276                                                             global data.
9277
9278     atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
9279                                                         2. s_waitcnt vmcnt(0) &
9280                                                            lgkmcnt(0)
9281
9282                                                           - If TgSplit execution mode,
9283                                                             omit lgkmcnt(0).
9284                                                           - If OpenCL, omit
9285                                                             lgkmcnt(0).
9286                                                           - Must happen before
9287                                                             following
9288                                                             buffer_inv.
9289                                                           - Ensures the
9290                                                             atomicrmw has
9291                                                             completed before
9292                                                             invalidating the
9293                                                             caches.
9294
9295                                                         3. buffer_inv sc0=1 sc1=1
9296
9297                                                           - Must happen before
9298                                                             any following
9299                                                             global/generic
9300                                                             load/load
9301                                                             atomic/atomicrmw.
9302                                                           - Ensures that
9303                                                             following
9304                                                             loads will not see
9305                                                             stale MTYPE NC global data.
9306                                                             MTYPE RW and CC memory will
9307                                                             never be stale due to the
9308                                                             memory probes.
9309
9310     fence        acquire      - singlethread *none*     *none*
9311                               - wavefront
9312     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9313
9314                                                           - Use lgkmcnt(0) if not
9315                                                             TgSplit execution mode
9316                                                             and vmcnt(0) if TgSplit
9317                                                             execution mode.
9318                                                           - If OpenCL and
9319                                                             address space is
9320                                                             not generic, omit
9321                                                             lgkmcnt(0).
9322                                                           - If OpenCL and
9323                                                             address space is
9324                                                             local, omit
9325                                                             vmcnt(0).
9326                                                           - However, since LLVM
9327                                                             currently has no
9328                                                             address space on
9329                                                             the fence need to
9330                                                             conservatively
9331                                                             always generate. If
9332                                                             fence had an
9333                                                             address space then
9334                                                             set to address
9335                                                             space of OpenCL
9336                                                             fence flag, or to
9337                                                             generic if both
9338                                                             local and global
9339                                                             flags are
9340                                                             specified.
9341                                                           - s_waitcnt vmcnt(0)
9342                                                             must happen after
9343                                                             any preceding
9344                                                             global/generic load
9345                                                             atomic/
9346                                                             atomicrmw
9347                                                             with an equal or
9348                                                             wider sync scope
9349                                                             and memory ordering
9350                                                             stronger than
9351                                                             unordered (this is
9352                                                             termed the
9353                                                             fence-paired-atomic).
9354                                                           - s_waitcnt lgkmcnt(0)
9355                                                             must happen after
9356                                                             any preceding
9357                                                             local/generic load
9358                                                             atomic/atomicrmw
9359                                                             with an equal or
9360                                                             wider sync scope
9361                                                             and memory ordering
9362                                                             stronger than
9363                                                             unordered (this is
9364                                                             termed the
9365                                                             fence-paired-atomic).
9366                                                           - Must happen before
9367                                                             the following
9368                                                             buffer_inv and
9369                                                             any following
9370                                                             global/generic
9371                                                             load/load
9372                                                             atomic/store/store
9373                                                             atomic/atomicrmw.
9374                                                           - Ensures any
9375                                                             following global
9376                                                             data read is no
9377                                                             older than the
9378                                                             value read by the
9379                                                             fence-paired-atomic.
9380
9381                                                         3. buffer_inv sc0=1
9382
9383                                                           - If not TgSplit execution
9384                                                             mode, omit.
9385                                                           - Ensures that
9386                                                             following
9387                                                             loads will not see
9388                                                             stale data.
9389
9390     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9391                                                            vmcnt(0)
9392
9393                                                           - If TgSplit execution mode,
9394                                                             omit lgkmcnt(0).
9395                                                           - If OpenCL and
9396                                                             address space is
9397                                                             not generic, omit
9398                                                             lgkmcnt(0).
9399                                                           - However, since LLVM
9400                                                             currently has no
9401                                                             address space on
9402                                                             the fence need to
9403                                                             conservatively
9404                                                             always generate
9405                                                             (see comment for
9406                                                             previous fence).
9407                                                           - Could be split into
9408                                                             separate s_waitcnt
9409                                                             vmcnt(0) and
9410                                                             s_waitcnt
9411                                                             lgkmcnt(0) to allow
9412                                                             them to be
9413                                                             independently moved
9414                                                             according to the
9415                                                             following rules.
9416                                                           - s_waitcnt vmcnt(0)
9417                                                             must happen after
9418                                                             any preceding
9419                                                             global/generic load
9420                                                             atomic/atomicrmw
9421                                                             with an equal or
9422                                                             wider sync scope
9423                                                             and memory ordering
9424                                                             stronger than
9425                                                             unordered (this is
9426                                                             termed the
9427                                                             fence-paired-atomic).
9428                                                           - s_waitcnt lgkmcnt(0)
9429                                                             must happen after
9430                                                             any preceding
9431                                                             local/generic load
9432                                                             atomic/atomicrmw
9433                                                             with an equal or
9434                                                             wider sync scope
9435                                                             and memory ordering
9436                                                             stronger than
9437                                                             unordered (this is
9438                                                             termed the
9439                                                             fence-paired-atomic).
9440                                                           - Must happen before
9441                                                             the following
9442                                                             buffer_inv.
9443                                                           - Ensures that the
9444                                                             fence-paired atomic
9445                                                             has completed
9446                                                             before invalidating
9447                                                             the
9448                                                             cache. Therefore
9449                                                             any following
9450                                                             locations read must
9451                                                             be no older than
9452                                                             the value read by
9453                                                             the
9454                                                             fence-paired-atomic.
9455
9456                                                         2. buffer_inv sc1=1
9457
9458                                                           - Must happen before any
9459                                                             following global/generic
9460                                                             load/load
9461                                                             atomic/store/store
9462                                                             atomic/atomicrmw.
9463                                                           - Ensures that
9464                                                             following loads
9465                                                             will not see stale
9466                                                             global data.
9467
9468     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
9469                                                            vmcnt(0)
9470
9471                                                           - If TgSplit execution mode,
9472                                                             omit lgkmcnt(0).
9473                                                           - If OpenCL and
9474                                                             address space is
9475                                                             not generic, omit
9476                                                             lgkmcnt(0).
9477                                                           - However, since LLVM
9478                                                             currently has no
9479                                                             address space on
9480                                                             the fence need to
9481                                                             conservatively
9482                                                             always generate
9483                                                             (see comment for
9484                                                             previous fence).
9485                                                           - Could be split into
9486                                                             separate s_waitcnt
9487                                                             vmcnt(0) and
9488                                                             s_waitcnt
9489                                                             lgkmcnt(0) to allow
9490                                                             them to be
9491                                                             independently moved
9492                                                             according to the
9493                                                             following rules.
9494                                                           - s_waitcnt vmcnt(0)
9495                                                             must happen after
9496                                                             any preceding
9497                                                             global/generic load
9498                                                             atomic/atomicrmw
9499                                                             with an equal or
9500                                                             wider sync scope
9501                                                             and memory ordering
9502                                                             stronger than
9503                                                             unordered (this is
9504                                                             termed the
9505                                                             fence-paired-atomic).
9506                                                           - s_waitcnt lgkmcnt(0)
9507                                                             must happen after
9508                                                             any preceding
9509                                                             local/generic load
9510                                                             atomic/atomicrmw
9511                                                             with an equal or
9512                                                             wider sync scope
9513                                                             and memory ordering
9514                                                             stronger than
9515                                                             unordered (this is
9516                                                             termed the
9517                                                             fence-paired-atomic).
9518                                                           - Must happen before
9519                                                             the following
9520                                                             buffer_inv.
9521                                                           - Ensures that the
9522                                                             fence-paired atomic
9523                                                             has completed
9524                                                             before invalidating
9525                                                             the
9526                                                             cache. Therefore
9527                                                             any following
9528                                                             locations read must
9529                                                             be no older than
9530                                                             the value read by
9531                                                             the
9532                                                             fence-paired-atomic.
9533
9534                                                         2. buffer_inv sc0=1 sc1=1
9535
9536                                                           - Must happen before any
9537                                                             following global/generic
9538                                                             load/load
9539                                                             atomic/store/store
9540                                                             atomic/atomicrmw.
9541                                                           - Ensures that
9542                                                             following loads
9543                                                             will not see stale
9544                                                             global data.
9545
9546     **Release Atomic**
9547     ------------------------------------------------------------------------------------
9548     store atomic release      - singlethread - global   1. buffer/global/flat_store
9549                               - wavefront    - generic
9550     store atomic release      - singlethread - local    *If TgSplit execution mode,
9551                               - wavefront               local address space cannot
9552                                                         be used.*
9553
9554                                                         1. ds_store
9555     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9556                                              - generic
9557                                                           - Use lgkmcnt(0) if not
9558                                                             TgSplit execution mode
9559                                                             and vmcnt(0) if TgSplit
9560                                                             execution mode.
9561                                                           - If OpenCL, omit lgkmcnt(0).
9562                                                           - s_waitcnt vmcnt(0)
9563                                                             must happen after
9564                                                             any preceding
9565                                                             global/generic load/store/
9566                                                             load atomic/store atomic/
9567                                                             atomicrmw.
9568                                                           - s_waitcnt lgkmcnt(0)
9569                                                             must happen after
9570                                                             any preceding
9571                                                             local/generic
9572                                                             load/store/load
9573                                                             atomic/store
9574                                                             atomic/atomicrmw.
9575                                                           - Must happen before
9576                                                             the following
9577                                                             store.
9578                                                           - Ensures that all
9579                                                             memory operations
9580                                                             have
9581                                                             completed before
9582                                                             performing the
9583                                                             store that is being
9584                                                             released.
9585
9586                                                         2. buffer/global/flat_store sc0=1
9587     store atomic release      - workgroup    - local    *If TgSplit execution mode,
9588                                                         local address space cannot
9589                                                         be used.*
9590
9591                                                         1. ds_store
9592     store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
9593                                              - generic
9594                                                           - Must happen before
9595                                                             following s_waitcnt.
9596                                                           - Performs L2 writeback to
9597                                                             ensure previous
9598                                                             global/generic
9599                                                             store/atomicrmw are
9600                                                             visible at agent scope.
9601
9602                                                         2. s_waitcnt lgkmcnt(0) &
9603                                                            vmcnt(0)
9604
9605                                                           - If TgSplit execution mode,
9606                                                             omit lgkmcnt(0).
9607                                                           - If OpenCL and
9608                                                             address space is
9609                                                             not generic, omit
9610                                                             lgkmcnt(0).
9611                                                           - Could be split into
9612                                                             separate s_waitcnt
9613                                                             vmcnt(0) and
9614                                                             s_waitcnt
9615                                                             lgkmcnt(0) to allow
9616                                                             them to be
9617                                                             independently moved
9618                                                             according to the
9619                                                             following rules.
9620                                                           - s_waitcnt vmcnt(0)
9621                                                             must happen after
9622                                                             any preceding
9623                                                             global/generic
9624                                                             load/store/load
9625                                                             atomic/store
9626                                                             atomic/atomicrmw.
9627                                                           - s_waitcnt lgkmcnt(0)
9628                                                             must happen after
9629                                                             any preceding
9630                                                             local/generic
9631                                                             load/store/load
9632                                                             atomic/store
9633                                                             atomic/atomicrmw.
9634                                                           - Must happen before
9635                                                             the following
9636                                                             store.
9637                                                           - Ensures that all
9638                                                             memory operations
9639                                                             to memory have
9640                                                             completed before
9641                                                             performing the
9642                                                             store that is being
9643                                                             released.
9644
9645                                                         3. buffer/global/flat_store sc1=1
9646     store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9647                                              - generic
9648                                                           - Must happen before
9649                                                             following s_waitcnt.
9650                                                           - Performs L2 writeback to
9651                                                             ensure previous
9652                                                             global/generic
9653                                                             store/atomicrmw are
9654                                                             visible at system scope.
9655
9656                                                         2. s_waitcnt lgkmcnt(0) &
9657                                                            vmcnt(0)
9658
9659                                                           - If TgSplit execution mode,
9660                                                             omit lgkmcnt(0).
9661                                                           - If OpenCL and
9662                                                             address space is
9663                                                             not generic, omit
9664                                                             lgkmcnt(0).
9665                                                           - Could be split into
9666                                                             separate s_waitcnt
9667                                                             vmcnt(0) and
9668                                                             s_waitcnt
9669                                                             lgkmcnt(0) to allow
9670                                                             them to be
9671                                                             independently moved
9672                                                             according to the
9673                                                             following rules.
9674                                                           - s_waitcnt vmcnt(0)
9675                                                             must happen after any
9676                                                             preceding
9677                                                             global/generic
9678                                                             load/store/load
9679                                                             atomic/store
9680                                                             atomic/atomicrmw.
9681                                                           - s_waitcnt lgkmcnt(0)
9682                                                             must happen after any
9683                                                             preceding
9684                                                             local/generic
9685                                                             load/store/load
9686                                                             atomic/store
9687                                                             atomic/atomicrmw.
9688                                                           - Must happen before
9689                                                             the following
9690                                                             store.
9691                                                           - Ensures that all
9692                                                             memory operations
9693                                                             to memory and the L2
9694                                                             writeback have
9695                                                             completed before
9696                                                             performing the
9697                                                             store that is being
9698                                                             released.
9699
9700                                                         3. buffer/global/flat_store
9701                                                            sc0=1 sc1=1
9702     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
9703                               - wavefront    - generic
9704     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
9705                               - wavefront               local address space cannot
9706                                                         be used.*
9707
9708                                                         1. ds_atomic
9709     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9710                                              - generic
9711                                                           - Use lgkmcnt(0) if not
9712                                                             TgSplit execution mode
9713                                                             and vmcnt(0) if TgSplit
9714                                                             execution mode.
9715                                                           - If OpenCL, omit
9716                                                             lgkmcnt(0).
9717                                                           - s_waitcnt vmcnt(0)
9718                                                             must happen after
9719                                                             any preceding
9720                                                             global/generic load/store/
9721                                                             load atomic/store atomic/
9722                                                             atomicrmw.
9723                                                           - s_waitcnt lgkmcnt(0)
9724                                                             must happen after
9725                                                             any preceding
9726                                                             local/generic
9727                                                             load/store/load
9728                                                             atomic/store
9729                                                             atomic/atomicrmw.
9730                                                           - Must happen before
9731                                                             the following
9732                                                             atomicrmw.
9733                                                           - Ensures that all
9734                                                             memory operations
9735                                                             have
9736                                                             completed before
9737                                                             performing the
9738                                                             atomicrmw that is
9739                                                             being released.
9740
9741                                                         2. buffer/global/flat_atomic sc0=1
9742     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
9743                                                         local address space cannot
9744                                                         be used.*
9745
9746                                                         1. ds_atomic
9747     atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
9748                                              - generic
9749                                                           - Must happen before
9750                                                             following s_waitcnt.
9751                                                           - Performs L2 writeback to
9752                                                             ensure previous
9753                                                             global/generic
9754                                                             store/atomicrmw are
9755                                                             visible at agent scope.
9756
9757                                                         2. s_waitcnt lgkmcnt(0) &
9758                                                            vmcnt(0)
9759
9760                                                           - If TgSplit execution mode,
9761                                                             omit lgkmcnt(0).
9762                                                           - If OpenCL, omit
9763                                                             lgkmcnt(0).
9764                                                           - Could be split into
9765                                                             separate s_waitcnt
9766                                                             vmcnt(0) and
9767                                                             s_waitcnt
9768                                                             lgkmcnt(0) to allow
9769                                                             them to be
9770                                                             independently moved
9771                                                             according to the
9772                                                             following rules.
9773                                                           - s_waitcnt vmcnt(0)
9774                                                             must happen after
9775                                                             any preceding
9776                                                             global/generic
9777                                                             load/store/load
9778                                                             atomic/store
9779                                                             atomic/atomicrmw.
9780                                                           - s_waitcnt lgkmcnt(0)
9781                                                             must happen after
9782                                                             any preceding
9783                                                             local/generic
9784                                                             load/store/load
9785                                                             atomic/store
9786                                                             atomic/atomicrmw.
9787                                                           - Must happen before
9788                                                             the following
9789                                                             atomicrmw.
9790                                                           - Ensures that all
9791                                                             memory operations
9792                                                             to global and local
9793                                                             have completed
9794                                                             before performing
9795                                                             the atomicrmw that
9796                                                             is being released.
9797
9798                                                         3. buffer/global/flat_atomic sc1=1
9799     atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9800                                              - generic
9801                                                           - Must happen before
9802                                                             following s_waitcnt.
9803                                                           - Performs L2 writeback to
9804                                                             ensure previous
9805                                                             global/generic
9806                                                             store/atomicrmw are
9807                                                             visible at system scope.
9808
9809                                                         2. s_waitcnt lgkmcnt(0) &
9810                                                            vmcnt(0)
9811
9812                                                           - If TgSplit execution mode,
9813                                                             omit lgkmcnt(0).
9814                                                           - If OpenCL, omit
9815                                                             lgkmcnt(0).
9816                                                           - Could be split into
9817                                                             separate s_waitcnt
9818                                                             vmcnt(0) and
9819                                                             s_waitcnt
9820                                                             lgkmcnt(0) to allow
9821                                                             them to be
9822                                                             independently moved
9823                                                             according to the
9824                                                             following rules.
9825                                                           - s_waitcnt vmcnt(0)
9826                                                             must happen after
9827                                                             any preceding
9828                                                             global/generic
9829                                                             load/store/load
9830                                                             atomic/store
9831                                                             atomic/atomicrmw.
9832                                                           - s_waitcnt lgkmcnt(0)
9833                                                             must happen after
9834                                                             any preceding
9835                                                             local/generic
9836                                                             load/store/load
9837                                                             atomic/store
9838                                                             atomic/atomicrmw.
9839                                                           - Must happen before
9840                                                             the following
9841                                                             atomicrmw.
9842                                                           - Ensures that all
9843                                                             memory operations
9844                                                             to memory and the L2
9845                                                             writeback have
9846                                                             completed before
9847                                                             performing the
9848                                                             store that is being
9849                                                             released.
9850
9851                                                         3. buffer/global/flat_atomic
9852                                                            sc0=1 sc1=1
9853     fence        release      - singlethread *none*     *none*
9854                               - wavefront
9855     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9856
9857                                                           - Use lgkmcnt(0) if not
9858                                                             TgSplit execution mode
9859                                                             and vmcnt(0) if TgSplit
9860                                                             execution mode.
9861                                                           - If OpenCL and
9862                                                             address space is
9863                                                             not generic, omit
9864                                                             lgkmcnt(0).
9865                                                           - If OpenCL and
9866                                                             address space is
9867                                                             local, omit
9868                                                             vmcnt(0).
9869                                                           - However, since LLVM
9870                                                             currently has no
9871                                                             address space on
9872                                                             the fence need to
9873                                                             conservatively
9874                                                             always generate. If
9875                                                             fence had an
9876                                                             address space then
9877                                                             set to address
9878                                                             space of OpenCL
9879                                                             fence flag, or to
9880                                                             generic if both
9881                                                             local and global
9882                                                             flags are
9883                                                             specified.
9884                                                           - s_waitcnt vmcnt(0)
9885                                                             must happen after
9886                                                             any preceding
9887                                                             global/generic
9888                                                             load/store/
9889                                                             load atomic/store atomic/
9890                                                             atomicrmw.
9891                                                           - s_waitcnt lgkmcnt(0)
9892                                                             must happen after
9893                                                             any preceding
9894                                                             local/generic
9895                                                             load/load
9896                                                             atomic/store/store
9897                                                             atomic/atomicrmw.
9898                                                           - Must happen before
9899                                                             any following store
9900                                                             atomic/atomicrmw
9901                                                             with an equal or
9902                                                             wider sync scope
9903                                                             and memory ordering
9904                                                             stronger than
9905                                                             unordered (this is
9906                                                             termed the
9907                                                             fence-paired-atomic).
9908                                                           - Ensures that all
9909                                                             memory operations
9910                                                             have
9911                                                             completed before
9912                                                             performing the
9913                                                             following
9914                                                             fence-paired-atomic.
9915
9916     fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
9917
9918                                                           - If OpenCL and
9919                                                             address space is
9920                                                             local, omit.
9921                                                           - Must happen before
9922                                                             following s_waitcnt.
9923                                                           - Performs L2 writeback to
9924                                                             ensure previous
9925                                                             global/generic
9926                                                             store/atomicrmw are
9927                                                             visible at agent scope.
9928
9929                                                         2. s_waitcnt lgkmcnt(0) &
9930                                                            vmcnt(0)
9931
9932                                                           - If TgSplit execution mode,
9933                                                             omit lgkmcnt(0).
9934                                                           - If OpenCL and
9935                                                             address space is
9936                                                             not generic, omit
9937                                                             lgkmcnt(0).
9938                                                           - If OpenCL and
9939                                                             address space is
9940                                                             local, omit
9941                                                             vmcnt(0).
9942                                                           - However, since LLVM
9943                                                             currently has no
9944                                                             address space on
9945                                                             the fence need to
9946                                                             conservatively
9947                                                             always generate. If
9948                                                             fence had an
9949                                                             address space then
9950                                                             set to address
9951                                                             space of OpenCL
9952                                                             fence flag, or to
9953                                                             generic if both
9954                                                             local and global
9955                                                             flags are
9956                                                             specified.
9957                                                           - Could be split into
9958                                                             separate s_waitcnt
9959                                                             vmcnt(0) and
9960                                                             s_waitcnt
9961                                                             lgkmcnt(0) to allow
9962                                                             them to be
9963                                                             independently moved
9964                                                             according to the
9965                                                             following rules.
9966                                                           - s_waitcnt vmcnt(0)
9967                                                             must happen after
9968                                                             any preceding
9969                                                             global/generic
9970                                                             load/store/load
9971                                                             atomic/store
9972                                                             atomic/atomicrmw.
9973                                                           - s_waitcnt lgkmcnt(0)
9974                                                             must happen after
9975                                                             any preceding
9976                                                             local/generic
9977                                                             load/store/load
9978                                                             atomic/store
9979                                                             atomic/atomicrmw.
9980                                                           - Must happen before
9981                                                             any following store
9982                                                             atomic/atomicrmw
9983                                                             with an equal or
9984                                                             wider sync scope
9985                                                             and memory ordering
9986                                                             stronger than
9987                                                             unordered (this is
9988                                                             termed the
9989                                                             fence-paired-atomic).
9990                                                           - Ensures that all
9991                                                             memory operations
9992                                                             have
9993                                                             completed before
9994                                                             performing the
9995                                                             following
9996                                                             fence-paired-atomic.
9997
9998     fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
9999
10000                                                           - Must happen before
10001                                                             following s_waitcnt.
10002                                                           - Performs L2 writeback to
10003                                                             ensure previous
10004                                                             global/generic
10005                                                             store/atomicrmw are
10006                                                             visible at system scope.
10007
10008                                                         2. s_waitcnt lgkmcnt(0) &
10009                                                            vmcnt(0)
10010
10011                                                           - If TgSplit execution mode,
10012                                                             omit lgkmcnt(0).
10013                                                           - If OpenCL and
10014                                                             address space is
10015                                                             not generic, omit
10016                                                             lgkmcnt(0).
10017                                                           - If OpenCL and
10018                                                             address space is
10019                                                             local, omit
10020                                                             vmcnt(0).
10021                                                           - However, since LLVM
10022                                                             currently has no
10023                                                             address space on
10024                                                             the fence need to
10025                                                             conservatively
10026                                                             always generate. If
10027                                                             fence had an
10028                                                             address space then
10029                                                             set to address
10030                                                             space of OpenCL
10031                                                             fence flag, or to
10032                                                             generic if both
10033                                                             local and global
10034                                                             flags are
10035                                                             specified.
10036                                                           - Could be split into
10037                                                             separate s_waitcnt
10038                                                             vmcnt(0) and
10039                                                             s_waitcnt
10040                                                             lgkmcnt(0) to allow
10041                                                             them to be
10042                                                             independently moved
10043                                                             according to the
10044                                                             following rules.
10045                                                           - s_waitcnt vmcnt(0)
10046                                                             must happen after
10047                                                             any preceding
10048                                                             global/generic
10049                                                             load/store/load
10050                                                             atomic/store
10051                                                             atomic/atomicrmw.
10052                                                           - s_waitcnt lgkmcnt(0)
10053                                                             must happen after
10054                                                             any preceding
10055                                                             local/generic
10056                                                             load/store/load
10057                                                             atomic/store
10058                                                             atomic/atomicrmw.
10059                                                           - Must happen before
10060                                                             any following store
10061                                                             atomic/atomicrmw
10062                                                             with an equal or
10063                                                             wider sync scope
10064                                                             and memory ordering
10065                                                             stronger than
10066                                                             unordered (this is
10067                                                             termed the
10068                                                             fence-paired-atomic).
10069                                                           - Ensures that all
10070                                                             memory operations
10071                                                             have
10072                                                             completed before
10073                                                             performing the
10074                                                             following
10075                                                             fence-paired-atomic.
10076
10077     **Acquire-Release Atomic**
10078     ------------------------------------------------------------------------------------
10079     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
10080                               - wavefront    - generic
10081     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
10082                               - wavefront               local address space cannot
10083                                                         be used.*
10084
10085                                                         1. ds_atomic
10086     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10087
10088                                                           - Use lgkmcnt(0) if not
10089                                                             TgSplit execution mode
10090                                                             and vmcnt(0) if TgSplit
10091                                                             execution mode.
10092                                                           - If OpenCL, omit
10093                                                             lgkmcnt(0).
10094                                                           - Must happen after
10095                                                             any preceding
10096                                                             local/generic
10097                                                             load/store/load
10098                                                             atomic/store
10099                                                             atomic/atomicrmw.
10100                                                           - s_waitcnt vmcnt(0)
10101                                                             must happen after
10102                                                             any preceding
10103                                                             global/generic load/store/
10104                                                             load atomic/store atomic/
10105                                                             atomicrmw.
10106                                                           - s_waitcnt lgkmcnt(0)
10107                                                             must happen after
10108                                                             any preceding
10109                                                             local/generic
10110                                                             load/store/load
10111                                                             atomic/store
10112                                                             atomic/atomicrmw.
10113                                                           - Must happen before
10114                                                             the following
10115                                                             atomicrmw.
10116                                                           - Ensures that all
10117                                                             memory operations
10118                                                             have
10119                                                             completed before
10120                                                             performing the
10121                                                             atomicrmw that is
10122                                                             being released.
10123
10124                                                         2. buffer/global_atomic
10125                                                         3. s_waitcnt vmcnt(0)
10126
10127                                                           - If not TgSplit execution
10128                                                             mode, omit.
10129                                                           - Must happen before
10130                                                             the following
10131                                                             buffer_inv.
10132                                                           - Ensures any
10133                                                             following global
10134                                                             data read is no
10135                                                             older than the
10136                                                             atomicrmw value
10137                                                             being acquired.
10138
10139                                                         4. buffer_inv sc0=1
10140
10141                                                           - If not TgSplit execution
10142                                                             mode, omit.
10143                                                           - Ensures that
10144                                                             following
10145                                                             loads will not see
10146                                                             stale data.
10147
10148     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
10149                                                         local address space cannot
10150                                                         be used.*
10151
10152                                                         1. ds_atomic
10153                                                         2. s_waitcnt lgkmcnt(0)
10154
10155                                                           - If OpenCL, omit.
10156                                                           - Must happen before
10157                                                             any following
10158                                                             global/generic
10159                                                             load/load
10160                                                             atomic/store/store
10161                                                             atomic/atomicrmw.
10162                                                           - Ensures any
10163                                                             following global
10164                                                             data read is no
10165                                                             older than the local load
10166                                                             atomic value being
10167                                                             acquired.
10168
10169     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
10170
10171                                                           - Use lgkmcnt(0) if not
10172                                                             TgSplit execution mode
10173                                                             and vmcnt(0) if TgSplit
10174                                                             execution mode.
10175                                                           - If OpenCL, omit
10176                                                             lgkmcnt(0).
10177                                                           - s_waitcnt vmcnt(0)
10178                                                             must happen after
10179                                                             any preceding
10180                                                             global/generic load/store/
10181                                                             load atomic/store atomic/
10182                                                             atomicrmw.
10183                                                           - s_waitcnt lgkmcnt(0)
10184                                                             must happen after
10185                                                             any preceding
10186                                                             local/generic
10187                                                             load/store/load
10188                                                             atomic/store
10189                                                             atomic/atomicrmw.
10190                                                           - Must happen before
10191                                                             the following
10192                                                             atomicrmw.
10193                                                           - Ensures that all
10194                                                             memory operations
10195                                                             have
10196                                                             completed before
10197                                                             performing the
10198                                                             atomicrmw that is
10199                                                             being released.
10200
10201                                                         2. flat_atomic
10202                                                         3. s_waitcnt lgkmcnt(0) &
10203                                                            vmcnt(0)
10204
10205                                                           - If not TgSplit execution
10206                                                             mode, omit vmcnt(0).
10207                                                           - If OpenCL, omit
10208                                                             lgkmcnt(0).
10209                                                           - Must happen before
10210                                                             the following
10211                                                             buffer_inv and
10212                                                             any following
10213                                                             global/generic
10214                                                             load/load
10215                                                             atomic/store/store
10216                                                             atomic/atomicrmw.
10217                                                           - Ensures any
10218                                                             following global
10219                                                             data read is no
10220                                                             older than a local load
10221                                                             atomic value being
10222                                                             acquired.
10223
10224                                                         3. buffer_inv sc0=1
10225
10226                                                           - If not TgSplit execution
10227                                                             mode, omit.
10228                                                           - Ensures that
10229                                                             following
10230                                                             loads will not see
10231                                                             stale data.
10232
10233     atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
10234
10235                                                           - Must happen before
10236                                                             following s_waitcnt.
10237                                                           - Performs L2 writeback to
10238                                                             ensure previous
10239                                                             global/generic
10240                                                             store/atomicrmw are
10241                                                             visible at agent scope.
10242
10243                                                         2. s_waitcnt lgkmcnt(0) &
10244                                                            vmcnt(0)
10245
10246                                                           - If TgSplit execution mode,
10247                                                             omit lgkmcnt(0).
10248                                                           - If OpenCL, omit
10249                                                             lgkmcnt(0).
10250                                                           - Could be split into
10251                                                             separate s_waitcnt
10252                                                             vmcnt(0) and
10253                                                             s_waitcnt
10254                                                             lgkmcnt(0) to allow
10255                                                             them to be
10256                                                             independently moved
10257                                                             according to the
10258                                                             following rules.
10259                                                           - s_waitcnt vmcnt(0)
10260                                                             must happen after
10261                                                             any preceding
10262                                                             global/generic
10263                                                             load/store/load
10264                                                             atomic/store
10265                                                             atomic/atomicrmw.
10266                                                           - s_waitcnt lgkmcnt(0)
10267                                                             must happen after
10268                                                             any preceding
10269                                                             local/generic
10270                                                             load/store/load
10271                                                             atomic/store
10272                                                             atomic/atomicrmw.
10273                                                           - Must happen before
10274                                                             the following
10275                                                             atomicrmw.
10276                                                           - Ensures that all
10277                                                             memory operations
10278                                                             to global have
10279                                                             completed before
10280                                                             performing the
10281                                                             atomicrmw that is
10282                                                             being released.
10283
10284                                                         3. buffer/global_atomic
10285                                                         4. s_waitcnt vmcnt(0)
10286
10287                                                           - Must happen before
10288                                                             following
10289                                                             buffer_inv.
10290                                                           - Ensures the
10291                                                             atomicrmw has
10292                                                             completed before
10293                                                             invalidating the
10294                                                             cache.
10295
10296                                                         5. buffer_inv sc1=1
10297
10298                                                           - Must happen before
10299                                                             any following
10300                                                             global/generic
10301                                                             load/load
10302                                                             atomic/atomicrmw.
10303                                                           - Ensures that
10304                                                             following loads
10305                                                             will not see stale
10306                                                             global data.
10307
10308     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10309
10310                                                           - Must happen before
10311                                                             following s_waitcnt.
10312                                                           - Performs L2 writeback to
10313                                                             ensure previous
10314                                                             global/generic
10315                                                             store/atomicrmw are
10316                                                             visible at system scope.
10317
10318                                                         2. s_waitcnt lgkmcnt(0) &
10319                                                            vmcnt(0)
10320
10321                                                           - If TgSplit execution mode,
10322                                                             omit lgkmcnt(0).
10323                                                           - If OpenCL, omit
10324                                                             lgkmcnt(0).
10325                                                           - Could be split into
10326                                                             separate s_waitcnt
10327                                                             vmcnt(0) and
10328                                                             s_waitcnt
10329                                                             lgkmcnt(0) to allow
10330                                                             them to be
10331                                                             independently moved
10332                                                             according to the
10333                                                             following rules.
10334                                                           - s_waitcnt vmcnt(0)
10335                                                             must happen after
10336                                                             any preceding
10337                                                             global/generic
10338                                                             load/store/load
10339                                                             atomic/store
10340                                                             atomic/atomicrmw.
10341                                                           - s_waitcnt lgkmcnt(0)
10342                                                             must happen after
10343                                                             any preceding
10344                                                             local/generic
10345                                                             load/store/load
10346                                                             atomic/store
10347                                                             atomic/atomicrmw.
10348                                                           - Must happen before
10349                                                             the following
10350                                                             atomicrmw.
10351                                                           - Ensures that all
10352                                                             memory operations
10353                                                             to global and L2 writeback
10354                                                             have completed before
10355                                                             performing the
10356                                                             atomicrmw that is
10357                                                             being released.
10358
10359                                                         3. buffer/global_atomic
10360                                                            sc1=1
10361                                                         4. s_waitcnt vmcnt(0)
10362
10363                                                           - Must happen before
10364                                                             following
10365                                                             buffer_inv.
10366                                                           - Ensures the
10367                                                             atomicrmw has
10368                                                             completed before
10369                                                             invalidating the
10370                                                             caches.
10371
10372                                                         5. buffer_inv sc0=1 sc1=1
10373
10374                                                           - Must happen before
10375                                                             any following
10376                                                             global/generic
10377                                                             load/load
10378                                                             atomic/atomicrmw.
10379                                                           - Ensures that
10380                                                             following loads
10381                                                             will not see stale
10382                                                             MTYPE NC global data.
10383                                                             MTYPE RW and CC memory will
10384                                                             never be stale due to the
10385                                                             memory probes.
10386
10387     atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
10388
10389                                                           - Must happen before
10390                                                             following s_waitcnt.
10391                                                           - Performs L2 writeback to
10392                                                             ensure previous
10393                                                             global/generic
10394                                                             store/atomicrmw are
10395                                                             visible at agent scope.
10396
10397                                                         2. s_waitcnt lgkmcnt(0) &
10398                                                            vmcnt(0)
10399
10400                                                           - If TgSplit execution mode,
10401                                                             omit lgkmcnt(0).
10402                                                           - If OpenCL, omit
10403                                                             lgkmcnt(0).
10404                                                           - Could be split into
10405                                                             separate s_waitcnt
10406                                                             vmcnt(0) and
10407                                                             s_waitcnt
10408                                                             lgkmcnt(0) to allow
10409                                                             them to be
10410                                                             independently moved
10411                                                             according to the
10412                                                             following rules.
10413                                                           - s_waitcnt vmcnt(0)
10414                                                             must happen after
10415                                                             any preceding
10416                                                             global/generic
10417                                                             load/store/load
10418                                                             atomic/store
10419                                                             atomic/atomicrmw.
10420                                                           - s_waitcnt lgkmcnt(0)
10421                                                             must happen after
10422                                                             any preceding
10423                                                             local/generic
10424                                                             load/store/load
10425                                                             atomic/store
10426                                                             atomic/atomicrmw.
10427                                                           - Must happen before
10428                                                             the following
10429                                                             atomicrmw.
10430                                                           - Ensures that all
10431                                                             memory operations
10432                                                             to global have
10433                                                             completed before
10434                                                             performing the
10435                                                             atomicrmw that is
10436                                                             being released.
10437
10438                                                         3. flat_atomic
10439                                                         4. s_waitcnt vmcnt(0) &
10440                                                            lgkmcnt(0)
10441
10442                                                           - If TgSplit execution mode,
10443                                                             omit lgkmcnt(0).
10444                                                           - If OpenCL, omit
10445                                                             lgkmcnt(0).
10446                                                           - Must happen before
10447                                                             following
10448                                                             buffer_inv.
10449                                                           - Ensures the
10450                                                             atomicrmw has
10451                                                             completed before
10452                                                             invalidating the
10453                                                             cache.
10454
10455                                                         5. buffer_inv sc1=1
10456
10457                                                           - Must happen before
10458                                                             any following
10459                                                             global/generic
10460                                                             load/load
10461                                                             atomic/atomicrmw.
10462                                                           - Ensures that
10463                                                             following loads
10464                                                             will not see stale
10465                                                             global data.
10466
10467     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
10468
10469                                                           - Must happen before
10470                                                             following s_waitcnt.
10471                                                           - Performs L2 writeback to
10472                                                             ensure previous
10473                                                             global/generic
10474                                                             store/atomicrmw are
10475                                                             visible at system scope.
10476
10477                                                         2. s_waitcnt lgkmcnt(0) &
10478                                                            vmcnt(0)
10479
10480                                                           - If TgSplit execution mode,
10481                                                             omit lgkmcnt(0).
10482                                                           - If OpenCL, omit
10483                                                             lgkmcnt(0).
10484                                                           - Could be split into
10485                                                             separate s_waitcnt
10486                                                             vmcnt(0) and
10487                                                             s_waitcnt
10488                                                             lgkmcnt(0) to allow
10489                                                             them to be
10490                                                             independently moved
10491                                                             according to the
10492                                                             following rules.
10493                                                           - s_waitcnt vmcnt(0)
10494                                                             must happen after
10495                                                             any preceding
10496                                                             global/generic
10497                                                             load/store/load
10498                                                             atomic/store
10499                                                             atomic/atomicrmw.
10500                                                           - s_waitcnt lgkmcnt(0)
10501                                                             must happen after
10502                                                             any preceding
10503                                                             local/generic
10504                                                             load/store/load
10505                                                             atomic/store
10506                                                             atomic/atomicrmw.
10507                                                           - Must happen before
10508                                                             the following
10509                                                             atomicrmw.
10510                                                           - Ensures that all
10511                                                             memory operations
10512                                                             to global and L2 writeback
10513                                                             have completed before
10514                                                             performing the
10515                                                             atomicrmw that is
10516                                                             being released.
10517
10518                                                         3. flat_atomic sc1=1
10519                                                         4. s_waitcnt vmcnt(0) &
10520                                                            lgkmcnt(0)
10521
10522                                                           - If TgSplit execution mode,
10523                                                             omit lgkmcnt(0).
10524                                                           - If OpenCL, omit
10525                                                             lgkmcnt(0).
10526                                                           - Must happen before
10527                                                             following
10528                                                             buffer_inv.
10529                                                           - Ensures the
10530                                                             atomicrmw has
10531                                                             completed before
10532                                                             invalidating the
10533                                                             caches.
10534
10535                                                         5. buffer_inv sc0=1 sc1=1
10536
10537                                                           - Must happen before
10538                                                             any following
10539                                                             global/generic
10540                                                             load/load
10541                                                             atomic/atomicrmw.
10542                                                           - Ensures that
10543                                                             following loads
10544                                                             will not see stale
10545                                                             MTYPE NC global data.
10546                                                             MTYPE RW and CC memory will
10547                                                             never be stale due to the
10548                                                             memory probes.
10549
10550     fence        acq_rel      - singlethread *none*     *none*
10551                               - wavefront
10552     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10553
10554                                                           - Use lgkmcnt(0) if not
10555                                                             TgSplit execution mode
10556                                                             and vmcnt(0) if TgSplit
10557                                                             execution mode.
10558                                                           - If OpenCL and
10559                                                             address space is
10560                                                             not generic, omit
10561                                                             lgkmcnt(0).
10562                                                           - If OpenCL and
10563                                                             address space is
10564                                                             local, omit
10565                                                             vmcnt(0).
10566                                                           - However,
10567                                                             since LLVM
10568                                                             currently has no
10569                                                             address space on
10570                                                             the fence need to
10571                                                             conservatively
10572                                                             always generate
10573                                                             (see comment for
10574                                                             previous fence).
10575                                                           - s_waitcnt vmcnt(0)
10576                                                             must happen after
10577                                                             any preceding
10578                                                             global/generic
10579                                                             load/store/
10580                                                             load atomic/store atomic/
10581                                                             atomicrmw.
10582                                                           - s_waitcnt lgkmcnt(0)
10583                                                             must happen after
10584                                                             any preceding
10585                                                             local/generic
10586                                                             load/load
10587                                                             atomic/store/store
10588                                                             atomic/atomicrmw.
10589                                                           - Must happen before
10590                                                             any following
10591                                                             global/generic
10592                                                             load/load
10593                                                             atomic/store/store
10594                                                             atomic/atomicrmw.
10595                                                           - Ensures that all
10596                                                             memory operations
10597                                                             have
10598                                                             completed before
10599                                                             performing any
10600                                                             following global
10601                                                             memory operations.
10602                                                           - Ensures that the
10603                                                             preceding
10604                                                             local/generic load
10605                                                             atomic/atomicrmw
10606                                                             with an equal or
10607                                                             wider sync scope
10608                                                             and memory ordering
10609                                                             stronger than
10610                                                             unordered (this is
10611                                                             termed the
10612                                                             acquire-fence-paired-atomic)
10613                                                             has completed
10614                                                             before following
10615                                                             global memory
10616                                                             operations. This
10617                                                             satisfies the
10618                                                             requirements of
10619                                                             acquire.
10620                                                           - Ensures that all
10621                                                             previous memory
10622                                                             operations have
10623                                                             completed before a
10624                                                             following
10625                                                             local/generic store
10626                                                             atomic/atomicrmw
10627                                                             with an equal or
10628                                                             wider sync scope
10629                                                             and memory ordering
10630                                                             stronger than
10631                                                             unordered (this is
10632                                                             termed the
10633                                                             release-fence-paired-atomic).
10634                                                             This satisfies the
10635                                                             requirements of
10636                                                             release.
10637                                                           - Must happen before
10638                                                             the following
10639                                                             buffer_inv.
10640                                                           - Ensures that the
10641                                                             acquire-fence-paired
10642                                                             atomic has completed
10643                                                             before invalidating
10644                                                             the
10645                                                             cache. Therefore
10646                                                             any following
10647                                                             locations read must
10648                                                             be no older than
10649                                                             the value read by
10650                                                             the
10651                                                             acquire-fence-paired-atomic.
10652
10653                                                         3. buffer_inv sc0=1
10654
10655                                                           - If not TgSplit execution
10656                                                             mode, omit.
10657                                                           - Ensures that
10658                                                             following
10659                                                             loads will not see
10660                                                             stale data.
10661
10662     fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
10663
10664                                                           - If OpenCL and
10665                                                             address space is
10666                                                             local, omit.
10667                                                           - Must happen before
10668                                                             following s_waitcnt.
10669                                                           - Performs L2 writeback to
10670                                                             ensure previous
10671                                                             global/generic
10672                                                             store/atomicrmw are
10673                                                             visible at agent scope.
10674
10675                                                         2. s_waitcnt lgkmcnt(0) &
10676                                                            vmcnt(0)
10677
10678                                                           - If TgSplit execution mode,
10679                                                             omit lgkmcnt(0).
10680                                                           - If OpenCL and
10681                                                             address space is
10682                                                             not generic, omit
10683                                                             lgkmcnt(0).
10684                                                           - However, since LLVM
10685                                                             currently has no
10686                                                             address space on
10687                                                             the fence need to
10688                                                             conservatively
10689                                                             always generate
10690                                                             (see comment for
10691                                                             previous fence).
10692                                                           - Could be split into
10693                                                             separate s_waitcnt
10694                                                             vmcnt(0) and
10695                                                             s_waitcnt
10696                                                             lgkmcnt(0) to allow
10697                                                             them to be
10698                                                             independently moved
10699                                                             according to the
10700                                                             following rules.
10701                                                           - s_waitcnt vmcnt(0)
10702                                                             must happen after
10703                                                             any preceding
10704                                                             global/generic
10705                                                             load/store/load
10706                                                             atomic/store
10707                                                             atomic/atomicrmw.
10708                                                           - s_waitcnt lgkmcnt(0)
10709                                                             must happen after
10710                                                             any preceding
10711                                                             local/generic
10712                                                             load/store/load
10713                                                             atomic/store
10714                                                             atomic/atomicrmw.
10715                                                           - Must happen before
10716                                                             the following
10717                                                             buffer_inv.
10718                                                           - Ensures that the
10719                                                             preceding
10720                                                             global/local/generic
10721                                                             load
10722                                                             atomic/atomicrmw
10723                                                             with an equal or
10724                                                             wider sync scope
10725                                                             and memory ordering
10726                                                             stronger than
10727                                                             unordered (this is
10728                                                             termed the
10729                                                             acquire-fence-paired-atomic)
10730                                                             has completed
10731                                                             before invalidating
10732                                                             the cache. This
10733                                                             satisfies the
10734                                                             requirements of
10735                                                             acquire.
10736                                                           - Ensures that all
10737                                                             previous memory
10738                                                             operations have
10739                                                             completed before a
10740                                                             following
10741                                                             global/local/generic
10742                                                             store
10743                                                             atomic/atomicrmw
10744                                                             with an equal or
10745                                                             wider sync scope
10746                                                             and memory ordering
10747                                                             stronger than
10748                                                             unordered (this is
10749                                                             termed the
10750                                                             release-fence-paired-atomic).
10751                                                             This satisfies the
10752                                                             requirements of
10753                                                             release.
10754
10755                                                         3. buffer_inv sc1=1
10756
10757                                                           - Must happen before
10758                                                             any following
10759                                                             global/generic
10760                                                             load/load
10761                                                             atomic/store/store
10762                                                             atomic/atomicrmw.
10763                                                           - Ensures that
10764                                                             following loads
10765                                                             will not see stale
10766                                                             global data. This
10767                                                             satisfies the
10768                                                             requirements of
10769                                                             acquire.
10770
10771     fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10772
10773                                                           - If OpenCL and
10774                                                             address space is
10775                                                             local, omit.
10776                                                           - Must happen before
10777                                                             following s_waitcnt.
10778                                                           - Performs L2 writeback to
10779                                                             ensure previous
10780                                                             global/generic
10781                                                             store/atomicrmw are
10782                                                             visible at system scope.
10783
10784                                                         1. s_waitcnt lgkmcnt(0) &
10785                                                            vmcnt(0)
10786
10787                                                           - If TgSplit execution mode,
10788                                                             omit lgkmcnt(0).
10789                                                           - If OpenCL and
10790                                                             address space is
10791                                                             not generic, omit
10792                                                             lgkmcnt(0).
10793                                                           - However, since LLVM
10794                                                             currently has no
10795                                                             address space on
10796                                                             the fence need to
10797                                                             conservatively
10798                                                             always generate
10799                                                             (see comment for
10800                                                             previous fence).
10801                                                           - Could be split into
10802                                                             separate s_waitcnt
10803                                                             vmcnt(0) and
10804                                                             s_waitcnt
10805                                                             lgkmcnt(0) to allow
10806                                                             them to be
10807                                                             independently moved
10808                                                             according to the
10809                                                             following rules.
10810                                                           - s_waitcnt vmcnt(0)
10811                                                             must happen after
10812                                                             any preceding
10813                                                             global/generic
10814                                                             load/store/load
10815                                                             atomic/store
10816                                                             atomic/atomicrmw.
10817                                                           - s_waitcnt lgkmcnt(0)
10818                                                             must happen after
10819                                                             any preceding
10820                                                             local/generic
10821                                                             load/store/load
10822                                                             atomic/store
10823                                                             atomic/atomicrmw.
10824                                                           - Must happen before
10825                                                             the following
10826                                                             buffer_inv.
10827                                                           - Ensures that the
10828                                                             preceding
10829                                                             global/local/generic
10830                                                             load
10831                                                             atomic/atomicrmw
10832                                                             with an equal or
10833                                                             wider sync scope
10834                                                             and memory ordering
10835                                                             stronger than
10836                                                             unordered (this is
10837                                                             termed the
10838                                                             acquire-fence-paired-atomic)
10839                                                             has completed
10840                                                             before invalidating
10841                                                             the cache. This
10842                                                             satisfies the
10843                                                             requirements of
10844                                                             acquire.
10845                                                           - Ensures that all
10846                                                             previous memory
10847                                                             operations have
10848                                                             completed before a
10849                                                             following
10850                                                             global/local/generic
10851                                                             store
10852                                                             atomic/atomicrmw
10853                                                             with an equal or
10854                                                             wider sync scope
10855                                                             and memory ordering
10856                                                             stronger than
10857                                                             unordered (this is
10858                                                             termed the
10859                                                             release-fence-paired-atomic).
10860                                                             This satisfies the
10861                                                             requirements of
10862                                                             release.
10863
10864                                                         2. buffer_inv sc0=1 sc1=1
10865
10866                                                           - Must happen before
10867                                                             any following
10868                                                             global/generic
10869                                                             load/load
10870                                                             atomic/store/store
10871                                                             atomic/atomicrmw.
10872                                                           - Ensures that
10873                                                             following loads
10874                                                             will not see stale
10875                                                             MTYPE NC global data.
10876                                                             MTYPE RW and CC memory will
10877                                                             never be stale due to the
10878                                                             memory probes.
10879
10880     **Sequential Consistent Atomic**
10881     ------------------------------------------------------------------------------------
10882     load atomic  seq_cst      - singlethread - global   *Same as corresponding
10883                               - wavefront    - local    load atomic acquire,
10884                                              - generic  except must generate
10885                                                         all instructions even
10886                                                         for OpenCL.*
10887     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10888                                              - generic
10889                                                           - Use lgkmcnt(0) if not
10890                                                             TgSplit execution mode
10891                                                             and vmcnt(0) if TgSplit
10892                                                             execution mode.
10893                                                           - s_waitcnt lgkmcnt(0) must
10894                                                             happen after
10895                                                             preceding
10896                                                             local/generic load
10897                                                             atomic/store
10898                                                             atomic/atomicrmw
10899                                                             with memory
10900                                                             ordering of seq_cst
10901                                                             and with equal or
10902                                                             wider sync scope.
10903                                                             (Note that seq_cst
10904                                                             fences have their
10905                                                             own s_waitcnt
10906                                                             lgkmcnt(0) and so do
10907                                                             not need to be
10908                                                             considered.)
10909                                                           - s_waitcnt vmcnt(0)
10910                                                             must happen after
10911                                                             preceding
10912                                                             global/generic load
10913                                                             atomic/store
10914                                                             atomic/atomicrmw
10915                                                             with memory
10916                                                             ordering of seq_cst
10917                                                             and with equal or
10918                                                             wider sync scope.
10919                                                             (Note that seq_cst
10920                                                             fences have their
10921                                                             own s_waitcnt
10922                                                             vmcnt(0) and so do
10923                                                             not need to be
10924                                                             considered.)
10925                                                           - Ensures any
10926                                                             preceding
10927                                                             sequential
10928                                                             consistent global/local
10929                                                             memory instructions
10930                                                             have completed
10931                                                             before executing
10932                                                             this sequentially
10933                                                             consistent
10934                                                             instruction. This
10935                                                             prevents reordering
10936                                                             a seq_cst store
10937                                                             followed by a
10938                                                             seq_cst load. (Note
10939                                                             that seq_cst is
10940                                                             stronger than
10941                                                             acquire/release as
10942                                                             the reordering of
10943                                                             load acquire
10944                                                             followed by a store
10945                                                             release is
10946                                                             prevented by the
10947                                                             s_waitcnt of
10948                                                             the release, but
10949                                                             there is nothing
10950                                                             preventing a store
10951                                                             release followed by
10952                                                             load acquire from
10953                                                             completing out of
10954                                                             order. The s_waitcnt
10955                                                             could be placed after
10956                                                             seq_store or before
10957                                                             the seq_load. We
10958                                                             choose the load to
10959                                                             make the s_waitcnt be
10960                                                             as late as possible
10961                                                             so that the store
10962                                                             may have already
10963                                                             completed.)
10964
10965                                                         2. *Following
10966                                                            instructions same as
10967                                                            corresponding load
10968                                                            atomic acquire,
10969                                                            except must generate
10970                                                            all instructions even
10971                                                            for OpenCL.*
10972     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
10973                                                         local address space cannot
10974                                                         be used.*
10975
10976                                                         *Same as corresponding
10977                                                         load atomic acquire,
10978                                                         except must generate
10979                                                         all instructions even
10980                                                         for OpenCL.*
10981
10982     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
10983                               - system       - generic     vmcnt(0)
10984
10985                                                           - If TgSplit execution mode,
10986                                                             omit lgkmcnt(0).
10987                                                           - Could be split into
10988                                                             separate s_waitcnt
10989                                                             vmcnt(0)
10990                                                             and s_waitcnt
10991                                                             lgkmcnt(0) to allow
10992                                                             them to be
10993                                                             independently moved
10994                                                             according to the
10995                                                             following rules.
10996                                                           - s_waitcnt lgkmcnt(0)
10997                                                             must happen after
10998                                                             preceding
10999                                                             global/generic load
11000                                                             atomic/store
11001                                                             atomic/atomicrmw
11002                                                             with memory
11003                                                             ordering of seq_cst
11004                                                             and with equal or
11005                                                             wider sync scope.
11006                                                             (Note that seq_cst
11007                                                             fences have their
11008                                                             own s_waitcnt
11009                                                             lgkmcnt(0) and so do
11010                                                             not need to be
11011                                                             considered.)
11012                                                           - s_waitcnt vmcnt(0)
11013                                                             must happen after
11014                                                             preceding
11015                                                             global/generic load
11016                                                             atomic/store
11017                                                             atomic/atomicrmw
11018                                                             with memory
11019                                                             ordering of seq_cst
11020                                                             and with equal or
11021                                                             wider sync scope.
11022                                                             (Note that seq_cst
11023                                                             fences have their
11024                                                             own s_waitcnt
11025                                                             vmcnt(0) and so do
11026                                                             not need to be
11027                                                             considered.)
11028                                                           - Ensures any
11029                                                             preceding
11030                                                             sequential
11031                                                             consistent global
11032                                                             memory instructions
11033                                                             have completed
11034                                                             before executing
11035                                                             this sequentially
11036                                                             consistent
11037                                                             instruction. This
11038                                                             prevents reordering
11039                                                             a seq_cst store
11040                                                             followed by a
11041                                                             seq_cst load. (Note
11042                                                             that seq_cst is
11043                                                             stronger than
11044                                                             acquire/release as
11045                                                             the reordering of
11046                                                             load acquire
11047                                                             followed by a store
11048                                                             release is
11049                                                             prevented by the
11050                                                             s_waitcnt of
11051                                                             the release, but
11052                                                             there is nothing
11053                                                             preventing a store
11054                                                             release followed by
11055                                                             load acquire from
11056                                                             completing out of
11057                                                             order. The s_waitcnt
11058                                                             could be placed after
11059                                                             seq_store or before
11060                                                             the seq_load. We
11061                                                             choose the load to
11062                                                             make the s_waitcnt be
11063                                                             as late as possible
11064                                                             so that the store
11065                                                             may have already
11066                                                             completed.)
11067
11068                                                         2. *Following
11069                                                            instructions same as
11070                                                            corresponding load
11071                                                            atomic acquire,
11072                                                            except must generate
11073                                                            all instructions even
11074                                                            for OpenCL.*
11075     store atomic seq_cst      - singlethread - global   *Same as corresponding
11076                               - wavefront    - local    store atomic release,
11077                               - workgroup    - generic  except must generate
11078                               - agent                   all instructions even
11079                               - system                  for OpenCL.*
11080     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
11081                               - wavefront    - local    atomicrmw acq_rel,
11082                               - workgroup    - generic  except must generate
11083                               - agent                   all instructions even
11084                               - system                  for OpenCL.*
11085     fence        seq_cst      - singlethread *none*     *Same as corresponding
11086                               - wavefront               fence acq_rel,
11087                               - workgroup               except must generate
11088                               - agent                   all instructions even
11089                               - system                  for OpenCL.*
11090     ============ ============ ============== ========== ================================
11091
11092.. _amdgpu-amdhsa-memory-model-gfx10:
11093
11094Memory Model GFX10
11095++++++++++++++++++
11096
11097For GFX10:
11098
11099* Each agent has multiple shader arrays (SA).
11100* Each SA has multiple work-group processors (WGP).
11101* Each WGP has multiple compute units (CU).
11102* Each CU has multiple SIMDs that execute wavefronts.
11103* The wavefronts for a single work-group are executed in the same
11104  WGP. In CU wavefront execution mode the wavefronts may be executed by
11105  different SIMDs in the same CU. In WGP wavefront execution mode the
11106  wavefronts may be executed by different SIMDs in different CUs in the same
11107  WGP.
11108* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11109  executing on it.
11110* All LDS operations of a WGP are performed as wavefront wide operations in a
11111  global order and involve no caching. Completion is reported to a wavefront in
11112  execution order.
11113* The LDS memory has multiple request queues shared by the SIMDs of a
11114  WGP. Therefore, the LDS operations performed by different wavefronts of a
11115  work-group can be reordered relative to each other, which can result in
11116  reordering the visibility of vector memory operations with respect to LDS
11117  operations of other wavefronts in the same work-group. A ``s_waitcnt
11118  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11119  vector memory operations between wavefronts of a work-group, but not between
11120  operations performed by the same wavefront.
11121* The vector memory operations are performed as wavefront wide operations.
11122  Completion of load/store/sample operations are reported to a wavefront in
11123  execution order of other load/store/sample operations performed by that
11124  wavefront.
11125* The vector memory operations access a vector L0 cache. There is a single L0
11126  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11127  special action is required for coherence between the lanes of a single
11128  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11129  wavefronts executing in the same work-group as they may be executing on SIMDs
11130  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11131  required for coherence between wavefronts executing in different work-groups
11132  as they may be executing on different WGPs.
11133* The scalar memory operations access a scalar L0 cache shared by all wavefronts
11134  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11135  operations are used in a restricted way so do not impact the memory model. See
11136  :ref:`amdgpu-amdhsa-memory-spaces`.
11137* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11138  the same SA. Therefore, no special action is required for coherence between
11139  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11140  required for coherence between wavefronts executing in different work-groups
11141  as they may be executing on different SAs that access different L1s.
11142* The L1 caches have independent quadrants to service disjoint ranges of virtual
11143  addresses.
11144* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11145  vector and scalar memory operations performed by different wavefronts, whether
11146  executing in the same or different work-groups (which may be executing on
11147  different CUs accessing different L0s), can be reordered relative to each
11148  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11149  synchronization between vector memory operations of different wavefronts. It
11150  ensures a previous vector memory operation has completed before executing a
11151  subsequent vector memory or LDS operation and so can be used to meet the
11152  requirements of acquire, release and sequential consistency.
11153* The L1 caches use an L2 cache shared by all SAs on the same agent.
11154* The L2 cache has independent channels to service disjoint ranges of virtual
11155  addresses.
11156* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11157  quadrant has a separate request queue per L2 channel. Therefore, the vector
11158  and scalar memory operations performed by wavefronts executing in different
11159  work-groups (which may be executing on different SAs) of an agent can be
11160  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11161  required to ensure synchronization between vector memory operations of
11162  different SAs. It ensures a previous vector memory operation has completed
11163  before executing a subsequent vector memory and so can be used to meet the
11164  requirements of acquire, release and sequential consistency.
11165* The L2 cache can be kept coherent with other agents on some targets, or ranges
11166  of virtual addresses can be set up to bypass it to ensure system coherence.
11167* On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory.
11168  The MALL cache is fully coherent with GPU memory and has no impact on system
11169  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11170
11171Scalar memory operations are only used to access memory that is proven to not
11172change during the execution of the kernel dispatch. This includes constant
11173address space and global address space for program scope ``const`` variables.
11174Therefore, the kernel machine code does not have to maintain the scalar cache to
11175ensure it is coherent with the vector caches. The scalar and vector caches are
11176invalidated between kernel dispatches by CP since constant address space data
11177may change between kernel dispatch executions. See
11178:ref:`amdgpu-amdhsa-memory-spaces`.
11179
11180The one exception is if scalar writes are used to spill SGPR registers. In this
11181case the AMDGPU backend ensures the memory location used to spill is never
11182accessed by vector memory operations at the same time. If scalar writes are used
11183then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11184return since the locations may be used for vector memory instructions by a
11185future wavefront that uses the same scratch area, or a function call that
11186creates a frame at the same address, respectively. There is no need for a
11187``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11188
11189For kernarg backing memory:
11190
11191* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11192* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11193  needing to invalidate the L2 cache.
11194* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11195  so the L2 cache will be coherent with the CPU and other agents.
11196
11197Scratch backing memory (which is used for the private address space) is accessed
11198with MTYPE NC (non-coherent). Since the private address space is only accessed
11199by a single thread, and is always write-before-read, there is never a need to
11200invalidate these entries from the L0 or L1 caches.
11201
11202Wavefronts are executed in native mode with in-order reporting of loads and
11203sample instructions. In this mode vmcnt reports completion of load, atomic with
11204return and sample instructions in order, and the vscnt reports the completion of
11205store and atomic without return in order. See ``MEM_ORDERED`` field in
11206:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
11207
11208Wavefronts can be executed in WGP or CU wavefront execution mode:
11209
11210* In WGP wavefront execution mode the wavefronts of a work-group are executed
11211  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11212  CU L0 caches is required for work-group synchronization. Also accesses to L1
11213  at work-group scope need to be explicitly ordered as the accesses from
11214  different CUs are not ordered.
11215* In CU wavefront execution mode the wavefronts of a work-group are executed on
11216  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11217  the work-group access the same L0 which in turn ensures L1 accesses are
11218  ordered and so do not require explicit management of the caches for
11219  work-group synchronization.
11220
11221See ``WGP_MODE`` field in
11222:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
11223:ref:`amdgpu-target-features`.
11224
11225The code sequences used to implement the memory model for GFX10 are defined in
11226table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
11227
11228  .. table:: AMDHSA Memory Model Code Sequences GFX10
11229     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
11230
11231     ============ ============ ============== ========== ================================
11232     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
11233                  Ordering     Sync Scope     Address    GFX10
11234                                              Space
11235     ============ ============ ============== ========== ================================
11236     **Non-Atomic**
11237     ------------------------------------------------------------------------------------
11238     load         *none*       *none*         - global   - !volatile & !nontemporal
11239                                              - generic
11240                                              - private    1. buffer/global/flat_load
11241                                              - constant
11242                                                         - !volatile & nontemporal
11243
11244                                                           1. buffer/global/flat_load
11245                                                              slc=1
11246
11247                                                         - volatile
11248
11249                                                           1. buffer/global/flat_load
11250                                                              glc=1 dlc=1
11251                                                           2. s_waitcnt vmcnt(0)
11252
11253                                                            - Must happen before
11254                                                              any following volatile
11255                                                              global/generic
11256                                                              load/store.
11257                                                            - Ensures that
11258                                                              volatile
11259                                                              operations to
11260                                                              different
11261                                                              addresses will not
11262                                                              be reordered by
11263                                                              hardware.
11264
11265     load         *none*       *none*         - local    1. ds_load
11266     store        *none*       *none*         - global   - !volatile & !nontemporal
11267                                              - generic
11268                                              - private    1. buffer/global/flat_store
11269                                              - constant
11270                                                         - !volatile & nontemporal
11271
11272                                                           1. buffer/global/flat_store
11273                                                              glc=1 slc=1
11274
11275                                                         - volatile
11276
11277                                                           1. buffer/global/flat_store
11278                                                           2. s_waitcnt vscnt(0)
11279
11280                                                            - Must happen before
11281                                                              any following volatile
11282                                                              global/generic
11283                                                              load/store.
11284                                                            - Ensures that
11285                                                              volatile
11286                                                              operations to
11287                                                              different
11288                                                              addresses will not
11289                                                              be reordered by
11290                                                              hardware.
11291
11292     store        *none*       *none*         - local    1. ds_store
11293     **Unordered Atomic**
11294     ------------------------------------------------------------------------------------
11295     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
11296     store atomic unordered    *any*          *any*      *Same as non-atomic*.
11297     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
11298     **Monotonic Atomic**
11299     ------------------------------------------------------------------------------------
11300     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
11301                               - wavefront    - generic
11302     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
11303                                              - generic     glc=1
11304
11305                                                           - If CU wavefront execution
11306                                                             mode, omit glc=1.
11307
11308     load atomic  monotonic    - singlethread - local    1. ds_load
11309                               - wavefront
11310                               - workgroup
11311     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
11312                               - system       - generic     glc=1 dlc=1
11313     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
11314                               - wavefront    - generic
11315                               - workgroup
11316                               - agent
11317                               - system
11318     store atomic monotonic    - singlethread - local    1. ds_store
11319                               - wavefront
11320                               - workgroup
11321     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
11322                               - wavefront    - generic
11323                               - workgroup
11324                               - agent
11325                               - system
11326     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
11327                               - wavefront
11328                               - workgroup
11329     **Acquire Atomic**
11330     ------------------------------------------------------------------------------------
11331     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
11332                               - wavefront    - local
11333                                              - generic
11334     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
11335
11336                                                           - If CU wavefront execution
11337                                                             mode, omit glc=1.
11338
11339                                                         2. s_waitcnt vmcnt(0)
11340
11341                                                           - If CU wavefront execution
11342                                                             mode, omit.
11343                                                           - Must happen before
11344                                                             the following buffer_gl0_inv
11345                                                             and before any following
11346                                                             global/generic
11347                                                             load/load
11348                                                             atomic/store/store
11349                                                             atomic/atomicrmw.
11350
11351                                                         3. buffer_gl0_inv
11352
11353                                                           - If CU wavefront execution
11354                                                             mode, omit.
11355                                                           - Ensures that
11356                                                             following
11357                                                             loads will not see
11358                                                             stale data.
11359
11360     load atomic  acquire      - workgroup    - local    1. ds_load
11361                                                         2. s_waitcnt lgkmcnt(0)
11362
11363                                                           - If OpenCL, omit.
11364                                                           - Must happen before
11365                                                             the following buffer_gl0_inv
11366                                                             and before any following
11367                                                             global/generic load/load
11368                                                             atomic/store/store
11369                                                             atomic/atomicrmw.
11370                                                           - Ensures any
11371                                                             following global
11372                                                             data read is no
11373                                                             older than the local load
11374                                                             atomic value being
11375                                                             acquired.
11376
11377                                                         3. buffer_gl0_inv
11378
11379                                                           - If CU wavefront execution
11380                                                             mode, omit.
11381                                                           - If OpenCL, omit.
11382                                                           - Ensures that
11383                                                             following
11384                                                             loads will not see
11385                                                             stale data.
11386
11387     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
11388
11389                                                           - If CU wavefront execution
11390                                                             mode, omit glc=1.
11391
11392                                                         2. s_waitcnt lgkmcnt(0) &
11393                                                            vmcnt(0)
11394
11395                                                           - If CU wavefront execution
11396                                                             mode, omit vmcnt(0).
11397                                                           - If OpenCL, omit
11398                                                             lgkmcnt(0).
11399                                                           - Must happen before
11400                                                             the following
11401                                                             buffer_gl0_inv and any
11402                                                             following global/generic
11403                                                             load/load
11404                                                             atomic/store/store
11405                                                             atomic/atomicrmw.
11406                                                           - Ensures any
11407                                                             following global
11408                                                             data read is no
11409                                                             older than a local load
11410                                                             atomic value being
11411                                                             acquired.
11412
11413                                                         3. buffer_gl0_inv
11414
11415                                                           - If CU wavefront execution
11416                                                             mode, omit.
11417                                                           - Ensures that
11418                                                             following
11419                                                             loads will not see
11420                                                             stale data.
11421
11422     load atomic  acquire      - agent        - global   1. buffer/global_load
11423                               - system                     glc=1 dlc=1
11424                                                         2. s_waitcnt vmcnt(0)
11425
11426                                                           - Must happen before
11427                                                             following
11428                                                             buffer_gl*_inv.
11429                                                           - Ensures the load
11430                                                             has completed
11431                                                             before invalidating
11432                                                             the caches.
11433
11434                                                         3. buffer_gl0_inv;
11435                                                            buffer_gl1_inv
11436
11437                                                           - Must happen before
11438                                                             any following
11439                                                             global/generic
11440                                                             load/load
11441                                                             atomic/atomicrmw.
11442                                                           - Ensures that
11443                                                             following
11444                                                             loads will not see
11445                                                             stale global data.
11446
11447     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
11448                               - system                  2. s_waitcnt vmcnt(0) &
11449                                                            lgkmcnt(0)
11450
11451                                                           - If OpenCL omit
11452                                                             lgkmcnt(0).
11453                                                           - Must happen before
11454                                                             following
11455                                                             buffer_gl*_invl.
11456                                                           - Ensures the flat_load
11457                                                             has completed
11458                                                             before invalidating
11459                                                             the caches.
11460
11461                                                         3. buffer_gl0_inv;
11462                                                            buffer_gl1_inv
11463
11464                                                           - Must happen before
11465                                                             any following
11466                                                             global/generic
11467                                                             load/load
11468                                                             atomic/atomicrmw.
11469                                                           - Ensures that
11470                                                             following loads
11471                                                             will not see stale
11472                                                             global data.
11473
11474     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
11475                               - wavefront    - local
11476                                              - generic
11477     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
11478                                                         2. s_waitcnt vm/vscnt(0)
11479
11480                                                           - If CU wavefront execution
11481                                                             mode, omit.
11482                                                           - Use vmcnt(0) if atomic with
11483                                                             return and vscnt(0) if
11484                                                             atomic with no-return.
11485                                                           - Must happen before
11486                                                             the following buffer_gl0_inv
11487                                                             and before any following
11488                                                             global/generic
11489                                                             load/load
11490                                                             atomic/store/store
11491                                                             atomic/atomicrmw.
11492
11493                                                         3. buffer_gl0_inv
11494
11495                                                           - If CU wavefront execution
11496                                                             mode, omit.
11497                                                           - Ensures that
11498                                                             following
11499                                                             loads will not see
11500                                                             stale data.
11501
11502     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
11503                                                         2. s_waitcnt lgkmcnt(0)
11504
11505                                                           - If OpenCL, omit.
11506                                                           - Must happen before
11507                                                             the following
11508                                                             buffer_gl0_inv.
11509                                                           - Ensures any
11510                                                             following global
11511                                                             data read is no
11512                                                             older than the local
11513                                                             atomicrmw value
11514                                                             being acquired.
11515
11516                                                         3. buffer_gl0_inv
11517
11518                                                           - If OpenCL omit.
11519                                                           - Ensures that
11520                                                             following
11521                                                             loads will not see
11522                                                             stale data.
11523
11524     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
11525                                                         2. s_waitcnt lgkmcnt(0) &
11526                                                            vm/vscnt(0)
11527
11528                                                           - If CU wavefront execution
11529                                                             mode, omit vm/vscnt(0).
11530                                                           - If OpenCL, omit lgkmcnt(0).
11531                                                           - Use vmcnt(0) if atomic with
11532                                                             return and vscnt(0) if
11533                                                             atomic with no-return.
11534                                                           - Must happen before
11535                                                             the following
11536                                                             buffer_gl0_inv.
11537                                                           - Ensures any
11538                                                             following global
11539                                                             data read is no
11540                                                             older than a local
11541                                                             atomicrmw value
11542                                                             being acquired.
11543
11544                                                         3. buffer_gl0_inv
11545
11546                                                           - If CU wavefront execution
11547                                                             mode, omit.
11548                                                           - Ensures that
11549                                                             following
11550                                                             loads will not see
11551                                                             stale data.
11552
11553     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
11554                               - system                  2. s_waitcnt vm/vscnt(0)
11555
11556                                                           - Use vmcnt(0) if atomic with
11557                                                             return and vscnt(0) if
11558                                                             atomic with no-return.
11559                                                           - Must happen before
11560                                                             following
11561                                                             buffer_gl*_inv.
11562                                                           - Ensures the
11563                                                             atomicrmw has
11564                                                             completed before
11565                                                             invalidating the
11566                                                             caches.
11567
11568                                                         3. buffer_gl0_inv;
11569                                                            buffer_gl1_inv
11570
11571                                                           - Must happen before
11572                                                             any following
11573                                                             global/generic
11574                                                             load/load
11575                                                             atomic/atomicrmw.
11576                                                           - Ensures that
11577                                                             following loads
11578                                                             will not see stale
11579                                                             global data.
11580
11581     atomicrmw    acquire      - agent        - generic  1. flat_atomic
11582                               - system                  2. s_waitcnt vm/vscnt(0) &
11583                                                            lgkmcnt(0)
11584
11585                                                           - If OpenCL, omit
11586                                                             lgkmcnt(0).
11587                                                           - Use vmcnt(0) if atomic with
11588                                                             return and vscnt(0) if
11589                                                             atomic with no-return.
11590                                                           - Must happen before
11591                                                             following
11592                                                             buffer_gl*_inv.
11593                                                           - Ensures the
11594                                                             atomicrmw has
11595                                                             completed before
11596                                                             invalidating the
11597                                                             caches.
11598
11599                                                         3. buffer_gl0_inv;
11600                                                            buffer_gl1_inv
11601
11602                                                           - Must happen before
11603                                                             any following
11604                                                             global/generic
11605                                                             load/load
11606                                                             atomic/atomicrmw.
11607                                                           - Ensures that
11608                                                             following loads
11609                                                             will not see stale
11610                                                             global data.
11611
11612     fence        acquire      - singlethread *none*     *none*
11613                               - wavefront
11614     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
11615                                                            vmcnt(0) & vscnt(0)
11616
11617                                                           - If CU wavefront execution
11618                                                             mode, omit vmcnt(0) and
11619                                                             vscnt(0).
11620                                                           - If OpenCL and
11621                                                             address space is
11622                                                             not generic, omit
11623                                                             lgkmcnt(0).
11624                                                           - If OpenCL and
11625                                                             address space is
11626                                                             local, omit
11627                                                             vmcnt(0) and vscnt(0).
11628                                                           - However, since LLVM
11629                                                             currently has no
11630                                                             address space on
11631                                                             the fence need to
11632                                                             conservatively
11633                                                             always generate. If
11634                                                             fence had an
11635                                                             address space then
11636                                                             set to address
11637                                                             space of OpenCL
11638                                                             fence flag, or to
11639                                                             generic if both
11640                                                             local and global
11641                                                             flags are
11642                                                             specified.
11643                                                           - Could be split into
11644                                                             separate s_waitcnt
11645                                                             vmcnt(0), s_waitcnt
11646                                                             vscnt(0) and s_waitcnt
11647                                                             lgkmcnt(0) to allow
11648                                                             them to be
11649                                                             independently moved
11650                                                             according to the
11651                                                             following rules.
11652                                                           - s_waitcnt vmcnt(0)
11653                                                             must happen after
11654                                                             any preceding
11655                                                             global/generic load
11656                                                             atomic/
11657                                                             atomicrmw-with-return-value
11658                                                             with an equal or
11659                                                             wider sync scope
11660                                                             and memory ordering
11661                                                             stronger than
11662                                                             unordered (this is
11663                                                             termed the
11664                                                             fence-paired-atomic).
11665                                                           - s_waitcnt vscnt(0)
11666                                                             must happen after
11667                                                             any preceding
11668                                                             global/generic
11669                                                             atomicrmw-no-return-value
11670                                                             with an equal or
11671                                                             wider sync scope
11672                                                             and memory ordering
11673                                                             stronger than
11674                                                             unordered (this is
11675                                                             termed the
11676                                                             fence-paired-atomic).
11677                                                           - s_waitcnt lgkmcnt(0)
11678                                                             must happen after
11679                                                             any preceding
11680                                                             local/generic load
11681                                                             atomic/atomicrmw
11682                                                             with an equal or
11683                                                             wider sync scope
11684                                                             and memory ordering
11685                                                             stronger than
11686                                                             unordered (this is
11687                                                             termed the
11688                                                             fence-paired-atomic).
11689                                                           - Must happen before
11690                                                             the following
11691                                                             buffer_gl0_inv.
11692                                                           - Ensures that the
11693                                                             fence-paired atomic
11694                                                             has completed
11695                                                             before invalidating
11696                                                             the
11697                                                             cache. Therefore
11698                                                             any following
11699                                                             locations read must
11700                                                             be no older than
11701                                                             the value read by
11702                                                             the
11703                                                             fence-paired-atomic.
11704
11705                                                         3. buffer_gl0_inv
11706
11707                                                           - If CU wavefront execution
11708                                                             mode, omit.
11709                                                           - Ensures that
11710                                                             following
11711                                                             loads will not see
11712                                                             stale data.
11713
11714     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
11715                               - system                     vmcnt(0) & vscnt(0)
11716
11717                                                           - If OpenCL and
11718                                                             address space is
11719                                                             not generic, omit
11720                                                             lgkmcnt(0).
11721                                                           - If OpenCL and
11722                                                             address space is
11723                                                             local, omit
11724                                                             vmcnt(0) and vscnt(0).
11725                                                           - However, since LLVM
11726                                                             currently has no
11727                                                             address space on
11728                                                             the fence need to
11729                                                             conservatively
11730                                                             always generate
11731                                                             (see comment for
11732                                                             previous fence).
11733                                                           - Could be split into
11734                                                             separate s_waitcnt
11735                                                             vmcnt(0), s_waitcnt
11736                                                             vscnt(0) and s_waitcnt
11737                                                             lgkmcnt(0) to allow
11738                                                             them to be
11739                                                             independently moved
11740                                                             according to the
11741                                                             following rules.
11742                                                           - s_waitcnt vmcnt(0)
11743                                                             must happen after
11744                                                             any preceding
11745                                                             global/generic load
11746                                                             atomic/
11747                                                             atomicrmw-with-return-value
11748                                                             with an equal or
11749                                                             wider sync scope
11750                                                             and memory ordering
11751                                                             stronger than
11752                                                             unordered (this is
11753                                                             termed the
11754                                                             fence-paired-atomic).
11755                                                           - s_waitcnt vscnt(0)
11756                                                             must happen after
11757                                                             any preceding
11758                                                             global/generic
11759                                                             atomicrmw-no-return-value
11760                                                             with an equal or
11761                                                             wider sync scope
11762                                                             and memory ordering
11763                                                             stronger than
11764                                                             unordered (this is
11765                                                             termed the
11766                                                             fence-paired-atomic).
11767                                                           - s_waitcnt lgkmcnt(0)
11768                                                             must happen after
11769                                                             any preceding
11770                                                             local/generic load
11771                                                             atomic/atomicrmw
11772                                                             with an equal or
11773                                                             wider sync scope
11774                                                             and memory ordering
11775                                                             stronger than
11776                                                             unordered (this is
11777                                                             termed the
11778                                                             fence-paired-atomic).
11779                                                           - Must happen before
11780                                                             the following
11781                                                             buffer_gl*_inv.
11782                                                           - Ensures that the
11783                                                             fence-paired atomic
11784                                                             has completed
11785                                                             before invalidating
11786                                                             the
11787                                                             caches. Therefore
11788                                                             any following
11789                                                             locations read must
11790                                                             be no older than
11791                                                             the value read by
11792                                                             the
11793                                                             fence-paired-atomic.
11794
11795                                                         2. buffer_gl0_inv;
11796                                                            buffer_gl1_inv
11797
11798                                                           - Must happen before any
11799                                                             following global/generic
11800                                                             load/load
11801                                                             atomic/store/store
11802                                                             atomic/atomicrmw.
11803                                                           - Ensures that
11804                                                             following loads
11805                                                             will not see stale
11806                                                             global data.
11807
11808     **Release Atomic**
11809     ------------------------------------------------------------------------------------
11810     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
11811                               - wavefront    - local
11812                                              - generic
11813     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
11814                                              - generic     vmcnt(0) & vscnt(0)
11815
11816                                                           - If CU wavefront execution
11817                                                             mode, omit vmcnt(0) and
11818                                                             vscnt(0).
11819                                                           - If OpenCL, omit
11820                                                             lgkmcnt(0).
11821                                                           - Could be split into
11822                                                             separate s_waitcnt
11823                                                             vmcnt(0), s_waitcnt
11824                                                             vscnt(0) and s_waitcnt
11825                                                             lgkmcnt(0) to allow
11826                                                             them to be
11827                                                             independently moved
11828                                                             according to the
11829                                                             following rules.
11830                                                           - s_waitcnt vmcnt(0)
11831                                                             must happen after
11832                                                             any preceding
11833                                                             global/generic load/load
11834                                                             atomic/
11835                                                             atomicrmw-with-return-value.
11836                                                           - s_waitcnt vscnt(0)
11837                                                             must happen after
11838                                                             any preceding
11839                                                             global/generic
11840                                                             store/store
11841                                                             atomic/
11842                                                             atomicrmw-no-return-value.
11843                                                           - s_waitcnt lgkmcnt(0)
11844                                                             must happen after
11845                                                             any preceding
11846                                                             local/generic
11847                                                             load/store/load
11848                                                             atomic/store
11849                                                             atomic/atomicrmw.
11850                                                           - Must happen before
11851                                                             the following
11852                                                             store.
11853                                                           - Ensures that all
11854                                                             memory operations
11855                                                             have
11856                                                             completed before
11857                                                             performing the
11858                                                             store that is being
11859                                                             released.
11860
11861                                                         2. buffer/global/flat_store
11862     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
11863
11864                                                           - If CU wavefront execution
11865                                                             mode, omit.
11866                                                           - If OpenCL, omit.
11867                                                           - Could be split into
11868                                                             separate s_waitcnt
11869                                                             vmcnt(0) and s_waitcnt
11870                                                             vscnt(0) to allow
11871                                                             them to be
11872                                                             independently moved
11873                                                             according to the
11874                                                             following rules.
11875                                                           - s_waitcnt vmcnt(0)
11876                                                             must happen after
11877                                                             any preceding
11878                                                             global/generic load/load
11879                                                             atomic/
11880                                                             atomicrmw-with-return-value.
11881                                                           - s_waitcnt vscnt(0)
11882                                                             must happen after
11883                                                             any preceding
11884                                                             global/generic
11885                                                             store/store atomic/
11886                                                             atomicrmw-no-return-value.
11887                                                           - Must happen before
11888                                                             the following
11889                                                             store.
11890                                                           - Ensures that all
11891                                                             global memory
11892                                                             operations have
11893                                                             completed before
11894                                                             performing the
11895                                                             store that is being
11896                                                             released.
11897
11898                                                         2. ds_store
11899     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
11900                               - system       - generic     vmcnt(0) & vscnt(0)
11901
11902                                                           - If OpenCL and
11903                                                             address space is
11904                                                             not generic, omit
11905                                                             lgkmcnt(0).
11906                                                           - Could be split into
11907                                                             separate s_waitcnt
11908                                                             vmcnt(0), s_waitcnt vscnt(0)
11909                                                             and s_waitcnt
11910                                                             lgkmcnt(0) to allow
11911                                                             them to be
11912                                                             independently moved
11913                                                             according to the
11914                                                             following rules.
11915                                                           - s_waitcnt vmcnt(0)
11916                                                             must happen after
11917                                                             any preceding
11918                                                             global/generic
11919                                                             load/load
11920                                                             atomic/
11921                                                             atomicrmw-with-return-value.
11922                                                           - s_waitcnt vscnt(0)
11923                                                             must happen after
11924                                                             any preceding
11925                                                             global/generic
11926                                                             store/store atomic/
11927                                                             atomicrmw-no-return-value.
11928                                                           - s_waitcnt lgkmcnt(0)
11929                                                             must happen after
11930                                                             any preceding
11931                                                             local/generic
11932                                                             load/store/load
11933                                                             atomic/store
11934                                                             atomic/atomicrmw.
11935                                                           - Must happen before
11936                                                             the following
11937                                                             store.
11938                                                           - Ensures that all
11939                                                             memory operations
11940                                                             have
11941                                                             completed before
11942                                                             performing the
11943                                                             store that is being
11944                                                             released.
11945
11946                                                         2. buffer/global/flat_store
11947     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
11948                               - wavefront    - local
11949                                              - generic
11950     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
11951                                              - generic     vmcnt(0) & vscnt(0)
11952
11953                                                           - If CU wavefront execution
11954                                                             mode, omit vmcnt(0) and
11955                                                             vscnt(0).
11956                                                           - If OpenCL, omit lgkmcnt(0).
11957                                                           - Could be split into
11958                                                             separate s_waitcnt
11959                                                             vmcnt(0), s_waitcnt
11960                                                             vscnt(0) and s_waitcnt
11961                                                             lgkmcnt(0) to allow
11962                                                             them to be
11963                                                             independently moved
11964                                                             according to the
11965                                                             following rules.
11966                                                           - s_waitcnt vmcnt(0)
11967                                                             must happen after
11968                                                             any preceding
11969                                                             global/generic load/load
11970                                                             atomic/
11971                                                             atomicrmw-with-return-value.
11972                                                           - s_waitcnt vscnt(0)
11973                                                             must happen after
11974                                                             any preceding
11975                                                             global/generic
11976                                                             store/store
11977                                                             atomic/
11978                                                             atomicrmw-no-return-value.
11979                                                           - s_waitcnt lgkmcnt(0)
11980                                                             must happen after
11981                                                             any preceding
11982                                                             local/generic
11983                                                             load/store/load
11984                                                             atomic/store
11985                                                             atomic/atomicrmw.
11986                                                           - Must happen before
11987                                                             the following
11988                                                             atomicrmw.
11989                                                           - Ensures that all
11990                                                             memory operations
11991                                                             have
11992                                                             completed before
11993                                                             performing the
11994                                                             atomicrmw that is
11995                                                             being released.
11996
11997                                                         2. buffer/global/flat_atomic
11998     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
11999
12000                                                           - If CU wavefront execution
12001                                                             mode, omit.
12002                                                           - If OpenCL, omit.
12003                                                           - Could be split into
12004                                                             separate s_waitcnt
12005                                                             vmcnt(0) and s_waitcnt
12006                                                             vscnt(0) to allow
12007                                                             them to be
12008                                                             independently moved
12009                                                             according to the
12010                                                             following rules.
12011                                                           - s_waitcnt vmcnt(0)
12012                                                             must happen after
12013                                                             any preceding
12014                                                             global/generic load/load
12015                                                             atomic/
12016                                                             atomicrmw-with-return-value.
12017                                                           - s_waitcnt vscnt(0)
12018                                                             must happen after
12019                                                             any preceding
12020                                                             global/generic
12021                                                             store/store atomic/
12022                                                             atomicrmw-no-return-value.
12023                                                           - Must happen before
12024                                                             the following
12025                                                             store.
12026                                                           - Ensures that all
12027                                                             global memory
12028                                                             operations have
12029                                                             completed before
12030                                                             performing the
12031                                                             store that is being
12032                                                             released.
12033
12034                                                         2. ds_atomic
12035     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12036                               - system       - generic      vmcnt(0) & vscnt(0)
12037
12038                                                           - If OpenCL, omit
12039                                                             lgkmcnt(0).
12040                                                           - Could be split into
12041                                                             separate s_waitcnt
12042                                                             vmcnt(0), s_waitcnt
12043                                                             vscnt(0) and s_waitcnt
12044                                                             lgkmcnt(0) to allow
12045                                                             them to be
12046                                                             independently moved
12047                                                             according to the
12048                                                             following rules.
12049                                                           - s_waitcnt vmcnt(0)
12050                                                             must happen after
12051                                                             any preceding
12052                                                             global/generic
12053                                                             load/load atomic/
12054                                                             atomicrmw-with-return-value.
12055                                                           - s_waitcnt vscnt(0)
12056                                                             must happen after
12057                                                             any preceding
12058                                                             global/generic
12059                                                             store/store atomic/
12060                                                             atomicrmw-no-return-value.
12061                                                           - s_waitcnt lgkmcnt(0)
12062                                                             must happen after
12063                                                             any preceding
12064                                                             local/generic
12065                                                             load/store/load
12066                                                             atomic/store
12067                                                             atomic/atomicrmw.
12068                                                           - Must happen before
12069                                                             the following
12070                                                             atomicrmw.
12071                                                           - Ensures that all
12072                                                             memory operations
12073                                                             to global and local
12074                                                             have completed
12075                                                             before performing
12076                                                             the atomicrmw that
12077                                                             is being released.
12078
12079                                                         2. buffer/global/flat_atomic
12080     fence        release      - singlethread *none*     *none*
12081                               - wavefront
12082     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12083                                                            vmcnt(0) & vscnt(0)
12084
12085                                                           - If CU wavefront execution
12086                                                             mode, omit vmcnt(0) and
12087                                                             vscnt(0).
12088                                                           - If OpenCL and
12089                                                             address space is
12090                                                             not generic, omit
12091                                                             lgkmcnt(0).
12092                                                           - If OpenCL and
12093                                                             address space is
12094                                                             local, omit
12095                                                             vmcnt(0) and vscnt(0).
12096                                                           - However, since LLVM
12097                                                             currently has no
12098                                                             address space on
12099                                                             the fence need to
12100                                                             conservatively
12101                                                             always generate. If
12102                                                             fence had an
12103                                                             address space then
12104                                                             set to address
12105                                                             space of OpenCL
12106                                                             fence flag, or to
12107                                                             generic if both
12108                                                             local and global
12109                                                             flags are
12110                                                             specified.
12111                                                           - Could be split into
12112                                                             separate s_waitcnt
12113                                                             vmcnt(0), s_waitcnt
12114                                                             vscnt(0) and s_waitcnt
12115                                                             lgkmcnt(0) to allow
12116                                                             them to be
12117                                                             independently moved
12118                                                             according to the
12119                                                             following rules.
12120                                                           - s_waitcnt vmcnt(0)
12121                                                             must happen after
12122                                                             any preceding
12123                                                             global/generic
12124                                                             load/load
12125                                                             atomic/
12126                                                             atomicrmw-with-return-value.
12127                                                           - s_waitcnt vscnt(0)
12128                                                             must happen after
12129                                                             any preceding
12130                                                             global/generic
12131                                                             store/store atomic/
12132                                                             atomicrmw-no-return-value.
12133                                                           - s_waitcnt lgkmcnt(0)
12134                                                             must happen after
12135                                                             any preceding
12136                                                             local/generic
12137                                                             load/store/load
12138                                                             atomic/store atomic/
12139                                                             atomicrmw.
12140                                                           - Must happen before
12141                                                             any following store
12142                                                             atomic/atomicrmw
12143                                                             with an equal or
12144                                                             wider sync scope
12145                                                             and memory ordering
12146                                                             stronger than
12147                                                             unordered (this is
12148                                                             termed the
12149                                                             fence-paired-atomic).
12150                                                           - Ensures that all
12151                                                             memory operations
12152                                                             have
12153                                                             completed before
12154                                                             performing the
12155                                                             following
12156                                                             fence-paired-atomic.
12157
12158     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12159                               - system                     vmcnt(0) & vscnt(0)
12160
12161                                                           - If OpenCL and
12162                                                             address space is
12163                                                             not generic, omit
12164                                                             lgkmcnt(0).
12165                                                           - If OpenCL and
12166                                                             address space is
12167                                                             local, omit
12168                                                             vmcnt(0) and vscnt(0).
12169                                                           - However, since LLVM
12170                                                             currently has no
12171                                                             address space on
12172                                                             the fence need to
12173                                                             conservatively
12174                                                             always generate. If
12175                                                             fence had an
12176                                                             address space then
12177                                                             set to address
12178                                                             space of OpenCL
12179                                                             fence flag, or to
12180                                                             generic if both
12181                                                             local and global
12182                                                             flags are
12183                                                             specified.
12184                                                           - Could be split into
12185                                                             separate s_waitcnt
12186                                                             vmcnt(0), s_waitcnt
12187                                                             vscnt(0) and s_waitcnt
12188                                                             lgkmcnt(0) to allow
12189                                                             them to be
12190                                                             independently moved
12191                                                             according to the
12192                                                             following rules.
12193                                                           - s_waitcnt vmcnt(0)
12194                                                             must happen after
12195                                                             any preceding
12196                                                             global/generic
12197                                                             load/load atomic/
12198                                                             atomicrmw-with-return-value.
12199                                                           - s_waitcnt vscnt(0)
12200                                                             must happen after
12201                                                             any preceding
12202                                                             global/generic
12203                                                             store/store atomic/
12204                                                             atomicrmw-no-return-value.
12205                                                           - s_waitcnt lgkmcnt(0)
12206                                                             must happen after
12207                                                             any preceding
12208                                                             local/generic
12209                                                             load/store/load
12210                                                             atomic/store
12211                                                             atomic/atomicrmw.
12212                                                           - Must happen before
12213                                                             any following store
12214                                                             atomic/atomicrmw
12215                                                             with an equal or
12216                                                             wider sync scope
12217                                                             and memory ordering
12218                                                             stronger than
12219                                                             unordered (this is
12220                                                             termed the
12221                                                             fence-paired-atomic).
12222                                                           - Ensures that all
12223                                                             memory operations
12224                                                             have
12225                                                             completed before
12226                                                             performing the
12227                                                             following
12228                                                             fence-paired-atomic.
12229
12230     **Acquire-Release Atomic**
12231     ------------------------------------------------------------------------------------
12232     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
12233                               - wavefront    - local
12234                                              - generic
12235     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12236                                                            vmcnt(0) & vscnt(0)
12237
12238                                                           - If CU wavefront execution
12239                                                             mode, omit vmcnt(0) and
12240                                                             vscnt(0).
12241                                                           - If OpenCL, omit
12242                                                             lgkmcnt(0).
12243                                                           - Must happen after
12244                                                             any preceding
12245                                                             local/generic
12246                                                             load/store/load
12247                                                             atomic/store
12248                                                             atomic/atomicrmw.
12249                                                           - Could be split into
12250                                                             separate s_waitcnt
12251                                                             vmcnt(0), s_waitcnt
12252                                                             vscnt(0), and s_waitcnt
12253                                                             lgkmcnt(0) to allow
12254                                                             them to be
12255                                                             independently moved
12256                                                             according to the
12257                                                             following rules.
12258                                                           - s_waitcnt vmcnt(0)
12259                                                             must happen after
12260                                                             any preceding
12261                                                             global/generic load/load
12262                                                             atomic/
12263                                                             atomicrmw-with-return-value.
12264                                                           - s_waitcnt vscnt(0)
12265                                                             must happen after
12266                                                             any preceding
12267                                                             global/generic
12268                                                             store/store
12269                                                             atomic/
12270                                                             atomicrmw-no-return-value.
12271                                                           - s_waitcnt lgkmcnt(0)
12272                                                             must happen after
12273                                                             any preceding
12274                                                             local/generic
12275                                                             load/store/load
12276                                                             atomic/store
12277                                                             atomic/atomicrmw.
12278                                                           - Must happen before
12279                                                             the following
12280                                                             atomicrmw.
12281                                                           - Ensures that all
12282                                                             memory operations
12283                                                             have
12284                                                             completed before
12285                                                             performing the
12286                                                             atomicrmw that is
12287                                                             being released.
12288
12289                                                         2. buffer/global_atomic
12290                                                         3. s_waitcnt vm/vscnt(0)
12291
12292                                                           - If CU wavefront execution
12293                                                             mode, omit.
12294                                                           - Use vmcnt(0) if atomic with
12295                                                             return and vscnt(0) if
12296                                                             atomic with no-return.
12297                                                           - Must happen before
12298                                                             the following
12299                                                             buffer_gl0_inv.
12300                                                           - Ensures any
12301                                                             following global
12302                                                             data read is no
12303                                                             older than the
12304                                                             atomicrmw value
12305                                                             being acquired.
12306
12307                                                         4. buffer_gl0_inv
12308
12309                                                           - If CU wavefront execution
12310                                                             mode, omit.
12311                                                           - Ensures that
12312                                                             following
12313                                                             loads will not see
12314                                                             stale data.
12315
12316     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12317
12318                                                           - If CU wavefront execution
12319                                                             mode, omit.
12320                                                           - If OpenCL, omit.
12321                                                           - Could be split into
12322                                                             separate s_waitcnt
12323                                                             vmcnt(0) and s_waitcnt
12324                                                             vscnt(0) to allow
12325                                                             them to be
12326                                                             independently moved
12327                                                             according to the
12328                                                             following rules.
12329                                                           - s_waitcnt vmcnt(0)
12330                                                             must happen after
12331                                                             any preceding
12332                                                             global/generic load/load
12333                                                             atomic/
12334                                                             atomicrmw-with-return-value.
12335                                                           - s_waitcnt vscnt(0)
12336                                                             must happen after
12337                                                             any preceding
12338                                                             global/generic
12339                                                             store/store atomic/
12340                                                             atomicrmw-no-return-value.
12341                                                           - Must happen before
12342                                                             the following
12343                                                             store.
12344                                                           - Ensures that all
12345                                                             global memory
12346                                                             operations have
12347                                                             completed before
12348                                                             performing the
12349                                                             store that is being
12350                                                             released.
12351
12352                                                         2. ds_atomic
12353                                                         3. s_waitcnt lgkmcnt(0)
12354
12355                                                           - If OpenCL, omit.
12356                                                           - Must happen before
12357                                                             the following
12358                                                             buffer_gl0_inv.
12359                                                           - Ensures any
12360                                                             following global
12361                                                             data read is no
12362                                                             older than the local load
12363                                                             atomic value being
12364                                                             acquired.
12365
12366                                                         4. buffer_gl0_inv
12367
12368                                                           - If CU wavefront execution
12369                                                             mode, omit.
12370                                                           - If OpenCL omit.
12371                                                           - Ensures that
12372                                                             following
12373                                                             loads will not see
12374                                                             stale data.
12375
12376     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
12377                                                            vmcnt(0) & vscnt(0)
12378
12379                                                           - If CU wavefront execution
12380                                                             mode, omit vmcnt(0) and
12381                                                             vscnt(0).
12382                                                           - If OpenCL, omit lgkmcnt(0).
12383                                                           - Could be split into
12384                                                             separate s_waitcnt
12385                                                             vmcnt(0), s_waitcnt
12386                                                             vscnt(0) and s_waitcnt
12387                                                             lgkmcnt(0) to allow
12388                                                             them to be
12389                                                             independently moved
12390                                                             according to the
12391                                                             following rules.
12392                                                           - s_waitcnt vmcnt(0)
12393                                                             must happen after
12394                                                             any preceding
12395                                                             global/generic load/load
12396                                                             atomic/
12397                                                             atomicrmw-with-return-value.
12398                                                           - s_waitcnt vscnt(0)
12399                                                             must happen after
12400                                                             any preceding
12401                                                             global/generic
12402                                                             store/store
12403                                                             atomic/
12404                                                             atomicrmw-no-return-value.
12405                                                           - s_waitcnt lgkmcnt(0)
12406                                                             must happen after
12407                                                             any preceding
12408                                                             local/generic
12409                                                             load/store/load
12410                                                             atomic/store
12411                                                             atomic/atomicrmw.
12412                                                           - Must happen before
12413                                                             the following
12414                                                             atomicrmw.
12415                                                           - Ensures that all
12416                                                             memory operations
12417                                                             have
12418                                                             completed before
12419                                                             performing the
12420                                                             atomicrmw that is
12421                                                             being released.
12422
12423                                                         2. flat_atomic
12424                                                         3. s_waitcnt lgkmcnt(0) &
12425                                                            vmcnt(0) & vscnt(0)
12426
12427                                                           - If CU wavefront execution
12428                                                             mode, omit vmcnt(0) and
12429                                                             vscnt(0).
12430                                                           - If OpenCL, omit lgkmcnt(0).
12431                                                           - Must happen before
12432                                                             the following
12433                                                             buffer_gl0_inv.
12434                                                           - Ensures any
12435                                                             following global
12436                                                             data read is no
12437                                                             older than the load
12438                                                             atomic value being
12439                                                             acquired.
12440
12441                                                         3. buffer_gl0_inv
12442
12443                                                           - If CU wavefront execution
12444                                                             mode, omit.
12445                                                           - Ensures that
12446                                                             following
12447                                                             loads will not see
12448                                                             stale data.
12449
12450     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12451                               - system                     vmcnt(0) & vscnt(0)
12452
12453                                                           - If OpenCL, omit
12454                                                             lgkmcnt(0).
12455                                                           - Could be split into
12456                                                             separate s_waitcnt
12457                                                             vmcnt(0), s_waitcnt
12458                                                             vscnt(0) and s_waitcnt
12459                                                             lgkmcnt(0) to allow
12460                                                             them to be
12461                                                             independently moved
12462                                                             according to the
12463                                                             following rules.
12464                                                           - s_waitcnt vmcnt(0)
12465                                                             must happen after
12466                                                             any preceding
12467                                                             global/generic
12468                                                             load/load atomic/
12469                                                             atomicrmw-with-return-value.
12470                                                           - s_waitcnt vscnt(0)
12471                                                             must happen after
12472                                                             any preceding
12473                                                             global/generic
12474                                                             store/store atomic/
12475                                                             atomicrmw-no-return-value.
12476                                                           - s_waitcnt lgkmcnt(0)
12477                                                             must happen after
12478                                                             any preceding
12479                                                             local/generic
12480                                                             load/store/load
12481                                                             atomic/store
12482                                                             atomic/atomicrmw.
12483                                                           - Must happen before
12484                                                             the following
12485                                                             atomicrmw.
12486                                                           - Ensures that all
12487                                                             memory operations
12488                                                             to global have
12489                                                             completed before
12490                                                             performing the
12491                                                             atomicrmw that is
12492                                                             being released.
12493
12494                                                         2. buffer/global_atomic
12495                                                         3. s_waitcnt vm/vscnt(0)
12496
12497                                                           - Use vmcnt(0) if atomic with
12498                                                             return and vscnt(0) if
12499                                                             atomic with no-return.
12500                                                           - Must happen before
12501                                                             following
12502                                                             buffer_gl*_inv.
12503                                                           - Ensures the
12504                                                             atomicrmw has
12505                                                             completed before
12506                                                             invalidating the
12507                                                             caches.
12508
12509                                                         4. buffer_gl0_inv;
12510                                                            buffer_gl1_inv
12511
12512                                                           - Must happen before
12513                                                             any following
12514                                                             global/generic
12515                                                             load/load
12516                                                             atomic/atomicrmw.
12517                                                           - Ensures that
12518                                                             following loads
12519                                                             will not see stale
12520                                                             global data.
12521
12522     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
12523                               - system                     vmcnt(0) & vscnt(0)
12524
12525                                                           - If OpenCL, omit
12526                                                             lgkmcnt(0).
12527                                                           - Could be split into
12528                                                             separate s_waitcnt
12529                                                             vmcnt(0), s_waitcnt
12530                                                             vscnt(0), and s_waitcnt
12531                                                             lgkmcnt(0) to allow
12532                                                             them to be
12533                                                             independently moved
12534                                                             according to the
12535                                                             following rules.
12536                                                           - s_waitcnt vmcnt(0)
12537                                                             must happen after
12538                                                             any preceding
12539                                                             global/generic
12540                                                             load/load atomic
12541                                                             atomicrmw-with-return-value.
12542                                                           - s_waitcnt vscnt(0)
12543                                                             must happen after
12544                                                             any preceding
12545                                                             global/generic
12546                                                             store/store atomic/
12547                                                             atomicrmw-no-return-value.
12548                                                           - s_waitcnt lgkmcnt(0)
12549                                                             must happen after
12550                                                             any preceding
12551                                                             local/generic
12552                                                             load/store/load
12553                                                             atomic/store
12554                                                             atomic/atomicrmw.
12555                                                           - Must happen before
12556                                                             the following
12557                                                             atomicrmw.
12558                                                           - Ensures that all
12559                                                             memory operations
12560                                                             have
12561                                                             completed before
12562                                                             performing the
12563                                                             atomicrmw that is
12564                                                             being released.
12565
12566                                                         2. flat_atomic
12567                                                         3. s_waitcnt vm/vscnt(0) &
12568                                                            lgkmcnt(0)
12569
12570                                                           - If OpenCL, omit
12571                                                             lgkmcnt(0).
12572                                                           - Use vmcnt(0) if atomic with
12573                                                             return and vscnt(0) if
12574                                                             atomic with no-return.
12575                                                           - Must happen before
12576                                                             following
12577                                                             buffer_gl*_inv.
12578                                                           - Ensures the
12579                                                             atomicrmw has
12580                                                             completed before
12581                                                             invalidating the
12582                                                             caches.
12583
12584                                                         4. buffer_gl0_inv;
12585                                                            buffer_gl1_inv
12586
12587                                                           - Must happen before
12588                                                             any following
12589                                                             global/generic
12590                                                             load/load
12591                                                             atomic/atomicrmw.
12592                                                           - Ensures that
12593                                                             following loads
12594                                                             will not see stale
12595                                                             global data.
12596
12597     fence        acq_rel      - singlethread *none*     *none*
12598                               - wavefront
12599     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12600                                                            vmcnt(0) & vscnt(0)
12601
12602                                                           - If CU wavefront execution
12603                                                             mode, omit vmcnt(0) and
12604                                                             vscnt(0).
12605                                                           - If OpenCL and
12606                                                             address space is
12607                                                             not generic, omit
12608                                                             lgkmcnt(0).
12609                                                           - If OpenCL and
12610                                                             address space is
12611                                                             local, omit
12612                                                             vmcnt(0) and vscnt(0).
12613                                                           - However,
12614                                                             since LLVM
12615                                                             currently has no
12616                                                             address space on
12617                                                             the fence need to
12618                                                             conservatively
12619                                                             always generate
12620                                                             (see comment for
12621                                                             previous fence).
12622                                                           - Could be split into
12623                                                             separate s_waitcnt
12624                                                             vmcnt(0), s_waitcnt
12625                                                             vscnt(0) and s_waitcnt
12626                                                             lgkmcnt(0) to allow
12627                                                             them to be
12628                                                             independently moved
12629                                                             according to the
12630                                                             following rules.
12631                                                           - s_waitcnt vmcnt(0)
12632                                                             must happen after
12633                                                             any preceding
12634                                                             global/generic
12635                                                             load/load
12636                                                             atomic/
12637                                                             atomicrmw-with-return-value.
12638                                                           - s_waitcnt vscnt(0)
12639                                                             must happen after
12640                                                             any preceding
12641                                                             global/generic
12642                                                             store/store atomic/
12643                                                             atomicrmw-no-return-value.
12644                                                           - s_waitcnt lgkmcnt(0)
12645                                                             must happen after
12646                                                             any preceding
12647                                                             local/generic
12648                                                             load/store/load
12649                                                             atomic/store atomic/
12650                                                             atomicrmw.
12651                                                           - Must happen before
12652                                                             any following
12653                                                             global/generic
12654                                                             load/load
12655                                                             atomic/store/store
12656                                                             atomic/atomicrmw.
12657                                                           - Ensures that all
12658                                                             memory operations
12659                                                             have
12660                                                             completed before
12661                                                             performing any
12662                                                             following global
12663                                                             memory operations.
12664                                                           - Ensures that the
12665                                                             preceding
12666                                                             local/generic load
12667                                                             atomic/atomicrmw
12668                                                             with an equal or
12669                                                             wider sync scope
12670                                                             and memory ordering
12671                                                             stronger than
12672                                                             unordered (this is
12673                                                             termed the
12674                                                             acquire-fence-paired-atomic)
12675                                                             has completed
12676                                                             before following
12677                                                             global memory
12678                                                             operations. This
12679                                                             satisfies the
12680                                                             requirements of
12681                                                             acquire.
12682                                                           - Ensures that all
12683                                                             previous memory
12684                                                             operations have
12685                                                             completed before a
12686                                                             following
12687                                                             local/generic store
12688                                                             atomic/atomicrmw
12689                                                             with an equal or
12690                                                             wider sync scope
12691                                                             and memory ordering
12692                                                             stronger than
12693                                                             unordered (this is
12694                                                             termed the
12695                                                             release-fence-paired-atomic).
12696                                                             This satisfies the
12697                                                             requirements of
12698                                                             release.
12699                                                           - Must happen before
12700                                                             the following
12701                                                             buffer_gl0_inv.
12702                                                           - Ensures that the
12703                                                             acquire-fence-paired
12704                                                             atomic has completed
12705                                                             before invalidating
12706                                                             the
12707                                                             cache. Therefore
12708                                                             any following
12709                                                             locations read must
12710                                                             be no older than
12711                                                             the value read by
12712                                                             the
12713                                                             acquire-fence-paired-atomic.
12714
12715                                                         3. buffer_gl0_inv
12716
12717                                                           - If CU wavefront execution
12718                                                             mode, omit.
12719                                                           - Ensures that
12720                                                             following
12721                                                             loads will not see
12722                                                             stale data.
12723
12724     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12725                               - system                     vmcnt(0) & vscnt(0)
12726
12727                                                           - If OpenCL and
12728                                                             address space is
12729                                                             not generic, omit
12730                                                             lgkmcnt(0).
12731                                                           - If OpenCL and
12732                                                             address space is
12733                                                             local, omit
12734                                                             vmcnt(0) and vscnt(0).
12735                                                           - However, since LLVM
12736                                                             currently has no
12737                                                             address space on
12738                                                             the fence need to
12739                                                             conservatively
12740                                                             always generate
12741                                                             (see comment for
12742                                                             previous fence).
12743                                                           - Could be split into
12744                                                             separate s_waitcnt
12745                                                             vmcnt(0), s_waitcnt
12746                                                             vscnt(0) and s_waitcnt
12747                                                             lgkmcnt(0) to allow
12748                                                             them to be
12749                                                             independently moved
12750                                                             according to the
12751                                                             following rules.
12752                                                           - s_waitcnt vmcnt(0)
12753                                                             must happen after
12754                                                             any preceding
12755                                                             global/generic
12756                                                             load/load
12757                                                             atomic/
12758                                                             atomicrmw-with-return-value.
12759                                                           - s_waitcnt vscnt(0)
12760                                                             must happen after
12761                                                             any preceding
12762                                                             global/generic
12763                                                             store/store atomic/
12764                                                             atomicrmw-no-return-value.
12765                                                           - s_waitcnt lgkmcnt(0)
12766                                                             must happen after
12767                                                             any preceding
12768                                                             local/generic
12769                                                             load/store/load
12770                                                             atomic/store
12771                                                             atomic/atomicrmw.
12772                                                           - Must happen before
12773                                                             the following
12774                                                             buffer_gl*_inv.
12775                                                           - Ensures that the
12776                                                             preceding
12777                                                             global/local/generic
12778                                                             load
12779                                                             atomic/atomicrmw
12780                                                             with an equal or
12781                                                             wider sync scope
12782                                                             and memory ordering
12783                                                             stronger than
12784                                                             unordered (this is
12785                                                             termed the
12786                                                             acquire-fence-paired-atomic)
12787                                                             has completed
12788                                                             before invalidating
12789                                                             the caches. This
12790                                                             satisfies the
12791                                                             requirements of
12792                                                             acquire.
12793                                                           - Ensures that all
12794                                                             previous memory
12795                                                             operations have
12796                                                             completed before a
12797                                                             following
12798                                                             global/local/generic
12799                                                             store
12800                                                             atomic/atomicrmw
12801                                                             with an equal or
12802                                                             wider sync scope
12803                                                             and memory ordering
12804                                                             stronger than
12805                                                             unordered (this is
12806                                                             termed the
12807                                                             release-fence-paired-atomic).
12808                                                             This satisfies the
12809                                                             requirements of
12810                                                             release.
12811
12812                                                         2. buffer_gl0_inv;
12813                                                            buffer_gl1_inv
12814
12815                                                           - Must happen before
12816                                                             any following
12817                                                             global/generic
12818                                                             load/load
12819                                                             atomic/store/store
12820                                                             atomic/atomicrmw.
12821                                                           - Ensures that
12822                                                             following loads
12823                                                             will not see stale
12824                                                             global data. This
12825                                                             satisfies the
12826                                                             requirements of
12827                                                             acquire.
12828
12829     **Sequential Consistent Atomic**
12830     ------------------------------------------------------------------------------------
12831     load atomic  seq_cst      - singlethread - global   *Same as corresponding
12832                               - wavefront    - local    load atomic acquire,
12833                                              - generic  except must generate
12834                                                         all instructions even
12835                                                         for OpenCL.*
12836     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12837                                              - generic     vmcnt(0) & vscnt(0)
12838
12839                                                           - If CU wavefront execution
12840                                                             mode, omit vmcnt(0) and
12841                                                             vscnt(0).
12842                                                           - Could be split into
12843                                                             separate s_waitcnt
12844                                                             vmcnt(0), s_waitcnt
12845                                                             vscnt(0), and s_waitcnt
12846                                                             lgkmcnt(0) to allow
12847                                                             them to be
12848                                                             independently moved
12849                                                             according to the
12850                                                             following rules.
12851                                                           - s_waitcnt lgkmcnt(0) must
12852                                                             happen after
12853                                                             preceding
12854                                                             local/generic load
12855                                                             atomic/store
12856                                                             atomic/atomicrmw
12857                                                             with memory
12858                                                             ordering of seq_cst
12859                                                             and with equal or
12860                                                             wider sync scope.
12861                                                             (Note that seq_cst
12862                                                             fences have their
12863                                                             own s_waitcnt
12864                                                             lgkmcnt(0) and so do
12865                                                             not need to be
12866                                                             considered.)
12867                                                           - s_waitcnt vmcnt(0)
12868                                                             must happen after
12869                                                             preceding
12870                                                             global/generic load
12871                                                             atomic/
12872                                                             atomicrmw-with-return-value
12873                                                             with memory
12874                                                             ordering of seq_cst
12875                                                             and with equal or
12876                                                             wider sync scope.
12877                                                             (Note that seq_cst
12878                                                             fences have their
12879                                                             own s_waitcnt
12880                                                             vmcnt(0) and so do
12881                                                             not need to be
12882                                                             considered.)
12883                                                           - s_waitcnt vscnt(0)
12884                                                             Must happen after
12885                                                             preceding
12886                                                             global/generic store
12887                                                             atomic/
12888                                                             atomicrmw-no-return-value
12889                                                             with memory
12890                                                             ordering of seq_cst
12891                                                             and with equal or
12892                                                             wider sync scope.
12893                                                             (Note that seq_cst
12894                                                             fences have their
12895                                                             own s_waitcnt
12896                                                             vscnt(0) and so do
12897                                                             not need to be
12898                                                             considered.)
12899                                                           - Ensures any
12900                                                             preceding
12901                                                             sequential
12902                                                             consistent global/local
12903                                                             memory instructions
12904                                                             have completed
12905                                                             before executing
12906                                                             this sequentially
12907                                                             consistent
12908                                                             instruction. This
12909                                                             prevents reordering
12910                                                             a seq_cst store
12911                                                             followed by a
12912                                                             seq_cst load. (Note
12913                                                             that seq_cst is
12914                                                             stronger than
12915                                                             acquire/release as
12916                                                             the reordering of
12917                                                             load acquire
12918                                                             followed by a store
12919                                                             release is
12920                                                             prevented by the
12921                                                             s_waitcnt of
12922                                                             the release, but
12923                                                             there is nothing
12924                                                             preventing a store
12925                                                             release followed by
12926                                                             load acquire from
12927                                                             completing out of
12928                                                             order. The s_waitcnt
12929                                                             could be placed after
12930                                                             seq_store or before
12931                                                             the seq_load. We
12932                                                             choose the load to
12933                                                             make the s_waitcnt be
12934                                                             as late as possible
12935                                                             so that the store
12936                                                             may have already
12937                                                             completed.)
12938
12939                                                         2. *Following
12940                                                            instructions same as
12941                                                            corresponding load
12942                                                            atomic acquire,
12943                                                            except must generate
12944                                                            all instructions even
12945                                                            for OpenCL.*
12946     load atomic  seq_cst      - workgroup    - local
12947
12948                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
12949
12950                                                           - If CU wavefront execution
12951                                                             mode, omit.
12952                                                           - Could be split into
12953                                                             separate s_waitcnt
12954                                                             vmcnt(0) and s_waitcnt
12955                                                             vscnt(0) to allow
12956                                                             them to be
12957                                                             independently moved
12958                                                             according to the
12959                                                             following rules.
12960                                                           - s_waitcnt vmcnt(0)
12961                                                             Must happen after
12962                                                             preceding
12963                                                             global/generic load
12964                                                             atomic/
12965                                                             atomicrmw-with-return-value
12966                                                             with memory
12967                                                             ordering of seq_cst
12968                                                             and with equal or
12969                                                             wider sync scope.
12970                                                             (Note that seq_cst
12971                                                             fences have their
12972                                                             own s_waitcnt
12973                                                             vmcnt(0) and so do
12974                                                             not need to be
12975                                                             considered.)
12976                                                           - s_waitcnt vscnt(0)
12977                                                             Must happen after
12978                                                             preceding
12979                                                             global/generic store
12980                                                             atomic/
12981                                                             atomicrmw-no-return-value
12982                                                             with memory
12983                                                             ordering of seq_cst
12984                                                             and with equal or
12985                                                             wider sync scope.
12986                                                             (Note that seq_cst
12987                                                             fences have their
12988                                                             own s_waitcnt
12989                                                             vscnt(0) and so do
12990                                                             not need to be
12991                                                             considered.)
12992                                                           - Ensures any
12993                                                             preceding
12994                                                             sequential
12995                                                             consistent global
12996                                                             memory instructions
12997                                                             have completed
12998                                                             before executing
12999                                                             this sequentially
13000                                                             consistent
13001                                                             instruction. This
13002                                                             prevents reordering
13003                                                             a seq_cst store
13004                                                             followed by a
13005                                                             seq_cst load. (Note
13006                                                             that seq_cst is
13007                                                             stronger than
13008                                                             acquire/release as
13009                                                             the reordering of
13010                                                             load acquire
13011                                                             followed by a store
13012                                                             release is
13013                                                             prevented by the
13014                                                             s_waitcnt of
13015                                                             the release, but
13016                                                             there is nothing
13017                                                             preventing a store
13018                                                             release followed by
13019                                                             load acquire from
13020                                                             completing out of
13021                                                             order. The s_waitcnt
13022                                                             could be placed after
13023                                                             seq_store or before
13024                                                             the seq_load. We
13025                                                             choose the load to
13026                                                             make the s_waitcnt be
13027                                                             as late as possible
13028                                                             so that the store
13029                                                             may have already
13030                                                             completed.)
13031
13032                                                         2. *Following
13033                                                            instructions same as
13034                                                            corresponding load
13035                                                            atomic acquire,
13036                                                            except must generate
13037                                                            all instructions even
13038                                                            for OpenCL.*
13039
13040     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13041                               - system       - generic     vmcnt(0) & vscnt(0)
13042
13043                                                           - Could be split into
13044                                                             separate s_waitcnt
13045                                                             vmcnt(0), s_waitcnt
13046                                                             vscnt(0) and s_waitcnt
13047                                                             lgkmcnt(0) to allow
13048                                                             them to be
13049                                                             independently moved
13050                                                             according to the
13051                                                             following rules.
13052                                                           - s_waitcnt lgkmcnt(0)
13053                                                             must happen after
13054                                                             preceding
13055                                                             local load
13056                                                             atomic/store
13057                                                             atomic/atomicrmw
13058                                                             with memory
13059                                                             ordering of seq_cst
13060                                                             and with equal or
13061                                                             wider sync scope.
13062                                                             (Note that seq_cst
13063                                                             fences have their
13064                                                             own s_waitcnt
13065                                                             lgkmcnt(0) and so do
13066                                                             not need to be
13067                                                             considered.)
13068                                                           - s_waitcnt vmcnt(0)
13069                                                             must happen after
13070                                                             preceding
13071                                                             global/generic load
13072                                                             atomic/
13073                                                             atomicrmw-with-return-value
13074                                                             with memory
13075                                                             ordering of seq_cst
13076                                                             and with equal or
13077                                                             wider sync scope.
13078                                                             (Note that seq_cst
13079                                                             fences have their
13080                                                             own s_waitcnt
13081                                                             vmcnt(0) and so do
13082                                                             not need to be
13083                                                             considered.)
13084                                                           - s_waitcnt vscnt(0)
13085                                                             Must happen after
13086                                                             preceding
13087                                                             global/generic store
13088                                                             atomic/
13089                                                             atomicrmw-no-return-value
13090                                                             with memory
13091                                                             ordering of seq_cst
13092                                                             and with equal or
13093                                                             wider sync scope.
13094                                                             (Note that seq_cst
13095                                                             fences have their
13096                                                             own s_waitcnt
13097                                                             vscnt(0) and so do
13098                                                             not need to be
13099                                                             considered.)
13100                                                           - Ensures any
13101                                                             preceding
13102                                                             sequential
13103                                                             consistent global
13104                                                             memory instructions
13105                                                             have completed
13106                                                             before executing
13107                                                             this sequentially
13108                                                             consistent
13109                                                             instruction. This
13110                                                             prevents reordering
13111                                                             a seq_cst store
13112                                                             followed by a
13113                                                             seq_cst load. (Note
13114                                                             that seq_cst is
13115                                                             stronger than
13116                                                             acquire/release as
13117                                                             the reordering of
13118                                                             load acquire
13119                                                             followed by a store
13120                                                             release is
13121                                                             prevented by the
13122                                                             s_waitcnt of
13123                                                             the release, but
13124                                                             there is nothing
13125                                                             preventing a store
13126                                                             release followed by
13127                                                             load acquire from
13128                                                             completing out of
13129                                                             order. The s_waitcnt
13130                                                             could be placed after
13131                                                             seq_store or before
13132                                                             the seq_load. We
13133                                                             choose the load to
13134                                                             make the s_waitcnt be
13135                                                             as late as possible
13136                                                             so that the store
13137                                                             may have already
13138                                                             completed.)
13139
13140                                                         2. *Following
13141                                                            instructions same as
13142                                                            corresponding load
13143                                                            atomic acquire,
13144                                                            except must generate
13145                                                            all instructions even
13146                                                            for OpenCL.*
13147     store atomic seq_cst      - singlethread - global   *Same as corresponding
13148                               - wavefront    - local    store atomic release,
13149                               - workgroup    - generic  except must generate
13150                               - agent                   all instructions even
13151                               - system                  for OpenCL.*
13152     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
13153                               - wavefront    - local    atomicrmw acq_rel,
13154                               - workgroup    - generic  except must generate
13155                               - agent                   all instructions even
13156                               - system                  for OpenCL.*
13157     fence        seq_cst      - singlethread *none*     *Same as corresponding
13158                               - wavefront               fence acq_rel,
13159                               - workgroup               except must generate
13160                               - agent                   all instructions even
13161                               - system                  for OpenCL.*
13162     ============ ============ ============== ========== ================================
13163
13164.. _amdgpu-amdhsa-trap-handler-abi:
13165
13166Trap Handler ABI
13167~~~~~~~~~~~~~~~~
13168
13169For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13170runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13171supports the ``s_trap`` instruction. For usage see:
13172
13173- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13174- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13175- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13176
13177  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13178     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13179
13180     =================== =============== =============== =======================================
13181     Usage               Code Sequence   Trap Handler    Description
13182                                         Inputs
13183     =================== =============== =============== =======================================
13184     reserved            ``s_trap 0x00``                 Reserved by hardware.
13185     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
13186                                           ``queue_ptr`` intrinsic (not implemented).
13187                                         ``VGPR0``:
13188                                           ``arg``
13189     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13190                                           ``queue_ptr`` the trap instruction. The associated
13191                                                         queue is signalled to put it into the
13192                                                         error state.  When the queue is put in
13193                                                         the error state, the waves executing
13194                                                         dispatches on the queue will be
13195                                                         terminated.
13196     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13197                                                           as a no-operation. The trap handler
13198                                                           is entered and immediately returns to
13199                                                           continue execution of the wavefront.
13200                                                         - If the debugger is enabled, causes
13201                                                           the debug trap to be reported by the
13202                                                           debugger and the wavefront is put in
13203                                                           the halt state with the PC at the
13204                                                           instruction.  The debugger must
13205                                                           increment the PC and resume the wave.
13206     reserved            ``s_trap 0x04``                 Reserved.
13207     reserved            ``s_trap 0x05``                 Reserved.
13208     reserved            ``s_trap 0x06``                 Reserved.
13209     reserved            ``s_trap 0x07``                 Reserved.
13210     reserved            ``s_trap 0x08``                 Reserved.
13211     reserved            ``s_trap 0xfe``                 Reserved.
13212     reserved            ``s_trap 0xff``                 Reserved.
13213     =================== =============== =============== =======================================
13214
13215..
13216
13217  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13218     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13219
13220     =================== =============== =============== =======================================
13221     Usage               Code Sequence   Trap Handler    Description
13222                                         Inputs
13223     =================== =============== =============== =======================================
13224     reserved            ``s_trap 0x00``                 Reserved by hardware.
13225     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
13226                                                         breakpoints. Causes wave to be halted
13227                                                         with the PC at the trap instruction.
13228                                                         The debugger is responsible to resume
13229                                                         the wave, including the instruction
13230                                                         that the breakpoint overwrote.
13231     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13232                                           ``queue_ptr`` the trap instruction. The associated
13233                                                         queue is signalled to put it into the
13234                                                         error state.  When the queue is put in
13235                                                         the error state, the waves executing
13236                                                         dispatches on the queue will be
13237                                                         terminated.
13238     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13239                                                           as a no-operation. The trap handler
13240                                                           is entered and immediately returns to
13241                                                           continue execution of the wavefront.
13242                                                         - If the debugger is enabled, causes
13243                                                           the debug trap to be reported by the
13244                                                           debugger and the wavefront is put in
13245                                                           the halt state with the PC at the
13246                                                           instruction.  The debugger must
13247                                                           increment the PC and resume the wave.
13248     reserved            ``s_trap 0x04``                 Reserved.
13249     reserved            ``s_trap 0x05``                 Reserved.
13250     reserved            ``s_trap 0x06``                 Reserved.
13251     reserved            ``s_trap 0x07``                 Reserved.
13252     reserved            ``s_trap 0x08``                 Reserved.
13253     reserved            ``s_trap 0xfe``                 Reserved.
13254     reserved            ``s_trap 0xff``                 Reserved.
13255     =================== =============== =============== =======================================
13256
13257..
13258
13259  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13260     :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13261
13262     =================== =============== ================ ================= =======================================
13263     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
13264     =================== =============== ================ ================= =======================================
13265     reserved            ``s_trap 0x00``                                    Reserved by hardware.
13266     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
13267                                                                            breakpoints. Causes wave to be halted
13268                                                                            with the PC at the trap instruction.
13269                                                                            The debugger is responsible to resume
13270                                                                            the wave, including the instruction
13271                                                                            that the breakpoint overwrote.
13272     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
13273                                           ``queue_ptr``                    the trap instruction. The associated
13274                                                                            queue is signalled to put it into the
13275                                                                            error state.  When the queue is put in
13276                                                                            the error state, the waves executing
13277                                                                            dispatches on the queue will be
13278                                                                            terminated.
13279     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
13280                                                                              as a no-operation. The trap handler
13281                                                                              is entered and immediately returns to
13282                                                                              continue execution of the wavefront.
13283                                                                            - If the debugger is enabled, causes
13284                                                                              the debug trap to be reported by the
13285                                                                              debugger and the wavefront is put in
13286                                                                              the halt state with the PC at the
13287                                                                              instruction.  The debugger must
13288                                                                              increment the PC and resume the wave.
13289     reserved            ``s_trap 0x04``                                    Reserved.
13290     reserved            ``s_trap 0x05``                                    Reserved.
13291     reserved            ``s_trap 0x06``                                    Reserved.
13292     reserved            ``s_trap 0x07``                                    Reserved.
13293     reserved            ``s_trap 0x08``                                    Reserved.
13294     reserved            ``s_trap 0xfe``                                    Reserved.
13295     reserved            ``s_trap 0xff``                                    Reserved.
13296     =================== =============== ================ ================= =======================================
13297
13298.. _amdgpu-amdhsa-function-call-convention:
13299
13300Call Convention
13301~~~~~~~~~~~~~~~
13302
13303.. note::
13304
13305  This section is currently incomplete and has inaccuracies. It is WIP that will
13306  be updated as information is determined.
13307
13308See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13309addresses. Unswizzled addresses are normal linear addresses.
13310
13311.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13312
13313Kernel Functions
13314++++++++++++++++
13315
13316This section describes the call convention ABI for the outer kernel function.
13317
13318See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13319convention.
13320
13321The following is not part of the AMDGPU kernel calling convention but describes
13322how the AMDGPU implements function calls:
13323
133241.  Clang decides the kernarg layout to match the *HSA Programmer's Language
13325    Reference* [HSA]_.
13326
13327    - All structs are passed directly.
13328    - Lambda values are passed *TBA*.
13329
13330    .. TODO::
13331
13332      - Does this really follow HSA rules? Or are structs >16 bytes passed
13333        by-value struct?
13334      - What is ABI for lambda values?
13335
133364.  The kernel performs certain setup in its prolog, as described in
13337    :ref:`amdgpu-amdhsa-kernel-prolog`.
13338
13339.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13340
13341Non-Kernel Functions
13342++++++++++++++++++++
13343
13344This section describes the call convention ABI for functions other than the
13345outer kernel function.
13346
13347If a kernel has function calls then scratch is always allocated and used for
13348the call stack which grows from low address to high address using the swizzled
13349scratch address space.
13350
13351On entry to a function:
13352
133531.  SGPR0-3 contain a V# with the following properties (see
13354    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13355
13356    * Base address pointing to the beginning of the wavefront scratch backing
13357      memory.
13358    * Swizzled with dword element size and stride of wavefront size elements.
13359
133602.  The FLAT_SCRATCH register pair is setup. See
13361    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
133623.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13363    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
133644.  The EXEC register is set to the lanes active on entry to the function.
133655.  MODE register: *TBD*
133666.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13367    below.
133687.  SGPR30-31 return address (RA). The code address that the function must
13369    return to when it completes. The value is undefined if the function is *no
13370    return*.
133718.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13372    offset relative to the beginning of the wavefront scratch backing memory.
13373
13374    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13375    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13376    manner.
13377
13378    The unswizzled SP value can be converted into the swizzled SP value by:
13379
13380      | swizzled SP = unswizzled SP / wavefront size
13381
13382    This may be used to obtain the private address space address of stack
13383    objects and to convert this address to a flat address by adding the flat
13384    scratch aperture base address.
13385
13386    The swizzled SP value is always 4 bytes aligned for the ``r600``
13387    architecture and 16 byte aligned for the ``amdgcn`` architecture.
13388
13389    .. note::
13390
13391      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13392      OpenCL language which has the largest base type defined as 16 bytes.
13393
13394    On entry, the swizzled SP value is the address of the first function
13395    argument passed on the stack. Other stack passed arguments are positive
13396    offsets from the entry swizzled SP value.
13397
13398    The function may use positive offsets beyond the last stack passed argument
13399    for stack allocated local variables and register spill slots. If necessary,
13400    the function may align these to greater alignment than 16 bytes. After these
13401    the function may dynamically allocate space for such things as runtime sized
13402    ``alloca`` local allocations.
13403
13404    If the function calls another function, it will place any stack allocated
13405    arguments after the last local allocation and adjust SGPR32 to the address
13406    after the last local allocation.
13407
134089.  All other registers are unspecified.
1340910. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13410    to the function.
13411
13412On exit from a function:
13413
134141.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13415    described below. Any registers used are considered clobbered registers.
134162.  The following registers are preserved and have the same value as on entry:
13417
13418    * FLAT_SCRATCH
13419    * EXEC
13420    * GFX6-GFX8: M0
13421    * All SGPR registers except the clobbered registers of SGPR4-31.
13422    * VGPR40-47
13423    * VGPR56-63
13424    * VGPR72-79
13425    * VGPR88-95
13426    * VGPR104-111
13427    * VGPR120-127
13428    * VGPR136-143
13429    * VGPR152-159
13430    * VGPR168-175
13431    * VGPR184-191
13432    * VGPR200-207
13433    * VGPR216-223
13434    * VGPR232-239
13435    * VGPR248-255
13436
13437        .. note::
13438
13439          Except the argument registers, the VGPRs clobbered and the preserved
13440          registers are intermixed at regular intervals in order to keep a
13441          similar ratio independent of the number of allocated VGPRs.
13442
13443    * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13444    * Lanes of all VGPRs that are inactive at the call site.
13445
13446      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13447      optimization may mark some of clobbered SGPR and VGPR registers as
13448      preserved if it can be determined that the called function does not change
13449      their value.
13450
134512.  The PC is set to the RA provided on entry.
134523.  MODE register: *TBD*.
134534.  All other registers are clobbered.
134545.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13455    function is available to the caller.
13456
13457.. TODO::
13458
13459  - How are function results returned? The address of structured types is passed
13460    by reference, but what about other types?
13461
13462The function input arguments are made up of the formal arguments explicitly
13463declared by the source language function plus the implicit input arguments used
13464by the implementation.
13465
13466The source language input arguments are:
13467
134681. Any source language implicit ``this`` or ``self`` argument comes first as a
13469   pointer type.
134702. Followed by the function formal arguments in left to right source order.
13471
13472The source language result arguments are:
13473
134741. The function result argument.
13475
13476The source language input or result struct type arguments that are less than or
13477equal to 16 bytes, are decomposed recursively into their base type fields, and
13478each field is passed as if a separate argument. For input arguments, if the
13479called function requires the struct to be in memory, for example because its
13480address is taken, then the function body is responsible for allocating a stack
13481location and copying the field arguments into it. Clang terms this *direct
13482struct*.
13483
13484The source language input struct type arguments that are greater than 16 bytes,
13485are passed by reference. The caller is responsible for allocating a stack
13486location to make a copy of the struct value and pass the address as the input
13487argument. The called function is responsible to perform the dereference when
13488accessing the input argument. Clang terms this *by-value struct*.
13489
13490A source language result struct type argument that is greater than 16 bytes, is
13491returned by reference. The caller is responsible for allocating a stack location
13492to hold the result value and passes the address as the last input argument
13493(before the implicit input arguments). In this case there are no result
13494arguments. The called function is responsible to perform the dereference when
13495storing the result value. Clang terms this *structured return (sret)*.
13496
13497*TODO: correct the ``sret`` definition.*
13498
13499.. TODO::
13500
13501  Is this definition correct? Or is ``sret`` only used if passing in registers, and
13502  pass as non-decomposed struct as stack argument? Or something else? Is the
13503  memory location in the caller stack frame, or a stack memory argument and so
13504  no address is passed as the caller can directly write to the argument stack
13505  location? But then the stack location is still live after return. If an
13506  argument stack location is it the first stack argument or the last one?
13507
13508Lambda argument types are treated as struct types with an implementation defined
13509set of fields.
13510
13511.. TODO::
13512
13513  Need to specify the ABI for lambda types for AMDGPU.
13514
13515For AMDGPU backend all source language arguments (including the decomposed
13516struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13517they are passed in SGPRs.
13518
13519The AMDGPU backend walks the function call graph from the leaves to determine
13520which implicit input arguments are used, propagating to each caller of the
13521function. The used implicit arguments are appended to the function arguments
13522after the source language arguments in the following order:
13523
13524.. TODO::
13525
13526  Is recursion or external functions supported?
13527
135281.  Work-Item ID (1 VGPR)
13529
13530    The X, Y and Z work-item ID are packed into a single VGRP with the following
13531    layout. Only fields actually used by the function are set. The other bits
13532    are undefined.
13533
13534    The values come from the initial kernel execution state. See
13535    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13536
13537    .. table:: Work-item implicit argument layout
13538      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13539
13540      ======= ======= ==============
13541      Bits    Size    Field Name
13542      ======= ======= ==============
13543      9:0     10 bits X Work-Item ID
13544      19:10   10 bits Y Work-Item ID
13545      29:20   10 bits Z Work-Item ID
13546      31:30   2 bits  Unused
13547      ======= ======= ==============
13548
135492.  Dispatch Ptr (2 SGPRs)
13550
13551    The value comes from the initial kernel execution state. See
13552    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13553
135543.  Queue Ptr (2 SGPRs)
13555
13556    The value comes from the initial kernel execution state. See
13557    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13558
135594.  Kernarg Segment Ptr (2 SGPRs)
13560
13561    The value comes from the initial kernel execution state. See
13562    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13563
135645.  Dispatch id (2 SGPRs)
13565
13566    The value comes from the initial kernel execution state. See
13567    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13568
135696.  Work-Group ID X (1 SGPR)
13570
13571    The value comes from the initial kernel execution state. See
13572    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13573
135747.  Work-Group ID Y (1 SGPR)
13575
13576    The value comes from the initial kernel execution state. See
13577    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13578
135798.  Work-Group ID Z (1 SGPR)
13580
13581    The value comes from the initial kernel execution state. See
13582    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13583
135849.  Implicit Argument Ptr (2 SGPRs)
13585
13586    The value is computed by adding an offset to Kernarg Segment Ptr to get the
13587    global address space pointer to the first kernarg implicit argument.
13588
13589The input and result arguments are assigned in order in the following manner:
13590
13591.. note::
13592
13593  There are likely some errors and omissions in the following description that
13594  need correction.
13595
13596  .. TODO::
13597
13598    Check the Clang source code to decipher how function arguments and return
13599    results are handled. Also see the AMDGPU specific values used.
13600
13601* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13602  VGPR31.
13603
13604  If there are more arguments than will fit in these registers, the remaining
13605  arguments are allocated on the stack in order on naturally aligned
13606  addresses.
13607
13608  .. TODO::
13609
13610    How are overly aligned structures allocated on the stack?
13611
13612* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13613  SGPR29.
13614
13615  If there are more arguments than will fit in these registers, the remaining
13616  arguments are allocated on the stack in order on naturally aligned
13617  addresses.
13618
13619Note that decomposed struct type arguments may have some fields passed in
13620registers and some in memory.
13621
13622.. TODO::
13623
13624  So, a struct which can pass some fields as decomposed register arguments, will
13625  pass the rest as decomposed stack elements? But an argument that will not start
13626  in registers will not be decomposed and will be passed as a non-decomposed
13627  stack value?
13628
13629The following is not part of the AMDGPU function calling convention but
13630describes how the AMDGPU implements function calls:
13631
136321.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13633    unswizzled scratch address. It is only needed if runtime sized ``alloca``
13634    are used, or for the reasons defined in ``SIFrameLowering``.
136352.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13636    to access the incoming stack arguments in the function. The BP is needed
13637    only when the function requires the runtime stack alignment.
13638
136393.  Allocating SGPR arguments on the stack are not supported.
13640
136414.  No CFI is currently generated. See
13642    :ref:`amdgpu-dwarf-call-frame-information`.
13643
13644    .. note::
13645
13646      CFI will be generated that defines the CFA as the unswizzled address
13647      relative to the wave scratch base in the unswizzled private address space
13648      of the lowest address stack allocated local variable.
13649
13650      ``DW_AT_frame_base`` will be defined as the swizzled address in the
13651      swizzled private address space by dividing the CFA by the wavefront size
13652      (since CFA is always at least dword aligned which matches the scratch
13653      swizzle element size).
13654
13655      If no dynamic stack alignment was performed, the stack allocated arguments
13656      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13657      local variables and register spill slots are accessed as positive offsets
13658      relative to ``DW_AT_frame_base``.
13659
136605.  Function argument passing is implemented by copying the input physical
13661    registers to virtual registers on entry. The register allocator can spill if
13662    necessary. These are copied back to physical registers at call sites. The
13663    net effect is that each function call can have these values in entirely
13664    distinct locations. The IPRA can help avoid shuffling argument registers.
136656.  Call sites are implemented by setting up the arguments at positive offsets
13666    from SP. Then SP is incremented to account for the known frame size before
13667    the call and decremented after the call.
13668
13669    .. note::
13670
13671      The CFI will reflect the changed calculation needed to compute the CFA
13672      from SP.
13673
136747.  4 byte spill slots are used in the stack frame. One slot is allocated for an
13675    emergency spill slot. Buffer instructions are used for stack accesses and
13676    not the ``flat_scratch`` instruction.
13677
13678    .. TODO::
13679
13680      Explain when the emergency spill slot is used.
13681
13682.. TODO::
13683
13684  Possible broken issues:
13685
13686  - Stack arguments must be aligned to required alignment.
13687  - Stack is aligned to max(16, max formal argument alignment)
13688  - Direct argument < 64 bits should check register budget.
13689  - Register budget calculation should respect ``inreg`` for SGPR.
13690  - SGPR overflow is not handled.
13691  - struct with 1 member unpeeling is not checking size of member.
13692  - ``sret`` is after ``this`` pointer.
13693  - Caller is not implementing stack realignment: need an extra pointer.
13694  - Should say AMDGPU passes FP rather than SP.
13695  - Should CFI define CFA as address of locals or arguments. Difference is
13696    apparent when have implemented dynamic alignment.
13697  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13698    highest address of stack frame and use negative offset for locals. Would
13699    allow SP to be the same as FP and could support signal-handler-like as now
13700    have a real SP for the top of the stack.
13701  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13702    arguments?
13703
13704AMDPAL
13705------
13706
13707This section provides code conventions used when the target triple OS is
13708``amdpal`` (see :ref:`amdgpu-target-triples`).
13709
13710.. _amdgpu-amdpal-code-object-metadata-section:
13711
13712Code Object Metadata
13713~~~~~~~~~~~~~~~~~~~~
13714
13715.. note::
13716
13717  The metadata is currently in development and is subject to major
13718  changes. Only the current version is supported. *When this document
13719  was generated the version was 2.6.*
13720
13721Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13722record (see :ref:`amdgpu-note-records-v3-onwards`).
13723
13724The metadata is represented as Message Pack formatted binary data (see
13725[MsgPack]_). The top level is a Message Pack map that includes the keys
13726defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13727and referenced tables.
13728
13729Additional information can be added to the maps. To avoid conflicts, any
13730key names should be prefixed by "*vendor-name*." where ``vendor-name``
13731can be the name of the vendor and specific vendor tool that generates the
13732information. The prefix is abbreviated to simply "." when it appears
13733within a map that has been added by the same *vendor-name*.
13734
13735  .. table:: AMDPAL Code Object Metadata Map
13736     :name: amdgpu-amdpal-code-object-metadata-map-table
13737
13738     =================== ============== ========= ======================================================================
13739     String Key          Value Type     Required? Description
13740     =================== ============== ========= ======================================================================
13741     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
13742                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13743     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
13744                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13745                                                  definition of the keys included in that map.
13746     =================== ============== ========= ======================================================================
13747
13748..
13749
13750  .. table:: AMDPAL Code Object Pipeline Metadata Map
13751     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13752
13753     ====================================== ============== ========= ===================================================
13754     String Key                             Value Type     Required? Description
13755     ====================================== ============== ========= ===================================================
13756     ".name"                                string                   Source name of the pipeline.
13757     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
13758
13759                                                                       - "VsPs"
13760                                                                       - "Gs"
13761                                                                       - "Cs"
13762                                                                       - "Ngg"
13763                                                                       - "Tess"
13764                                                                       - "GsTess"
13765                                                                       - "NggTess"
13766
13767     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
13768                                            2 integers               64 bits is the "stable" portion of the hash, used
13769                                                                     for e.g. shader replacement lookup. Upper 64 bits
13770                                                                     is the "unique" portion of the hash, used for
13771                                                                     e.g. pipeline cache lookup. The value is
13772                                                                     implementation defined, and can not be relied on
13773                                                                     between different builds of the compiler.
13774     ".shaders"                             map                      Per-API shader metadata. See
13775                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
13776                                                                     for the definition of the keys included in that
13777                                                                     map.
13778     ".hardware_stages"                     map                      Per-hardware stage metadata. See
13779                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13780                                                                     for the definition of the keys included in that
13781                                                                     map.
13782     ".shader_functions"                    map                      Per-shader function metadata. See
13783                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13784                                                                     for the definition of the keys included in that
13785                                                                     map.
13786     ".registers"                           map            Required  Hardware register configuration. See
13787                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
13788                                                                     for the definition of the keys included in that
13789                                                                     map.
13790     ".user_data_limit"                     integer                  Number of user data entries accessed by this
13791                                                                     pipeline.
13792     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
13793                                                                     NoUserDataSpilling.
13794     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
13795                                                                     viewport array index feature. Pipelines which use
13796                                                                     this feature can render into all 16 viewports,
13797                                                                     whereas pipelines which do not use it are
13798                                                                     restricted to viewport #0.
13799     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
13800                                                                     handling data-passing between the ES and GS
13801                                                                     shader stages. This can be zero if the data is
13802                                                                     passed using off-chip buffers. This value should
13803                                                                     be used to program all user-SGPRs which have been
13804                                                                     marked with "UserDataMapping::EsGsLdsSize"
13805                                                                     (typically only the GS and VS HW stages will ever
13806                                                                     have a user-SGPR so marked).
13807     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
13808                                                                     (maximum number of threads in a subgroup).
13809     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
13810     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
13811     ".api"                                 string                   Name of the client graphics API.
13812     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
13813                                                                     be defined by the driver using the compiler if
13814                                                                     they want to be able to correlate API-specific
13815                                                                     information used during creation at a later time.
13816     ====================================== ============== ========= ===================================================
13817
13818..
13819
13820  .. table:: AMDPAL Code Object Shader Map
13821     :name: amdgpu-amdpal-code-object-shader-map-table
13822
13823
13824     +-------------+--------------+-------------------------------------------------------------------+
13825     |String Key   |Value Type    |Description                                                        |
13826     +=============+==============+===================================================================+
13827     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
13828     |- ".vertex"  |              |for the definition of the keys included in that map.               |
13829     |- ".hull"    |              |                                                                   |
13830     |- ".domain"  |              |                                                                   |
13831     |- ".geometry"|              |                                                                   |
13832     |- ".pixel"   |              |                                                                   |
13833     +-------------+--------------+-------------------------------------------------------------------+
13834
13835..
13836
13837  .. table:: AMDPAL Code Object API Shader Metadata Map
13838     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
13839
13840     ==================== ============== ========= =====================================================================
13841     String Key           Value Type     Required? Description
13842     ==================== ============== ========= =====================================================================
13843     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
13844                          2 integers               is implementation defined, and can not be relied on between
13845                                                   different builds of the compiler.
13846     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
13847                          string                   include:
13848
13849                                                     - ".ls"
13850                                                     - ".hs"
13851                                                     - ".es"
13852                                                     - ".gs"
13853                                                     - ".vs"
13854                                                     - ".ps"
13855                                                     - ".cs"
13856
13857     ==================== ============== ========= =====================================================================
13858
13859..
13860
13861  .. table:: AMDPAL Code Object Hardware Stage Map
13862     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
13863
13864     +-------------+--------------+-----------------------------------------------------------------------+
13865     |String Key   |Value Type    |Description                                                            |
13866     +=============+==============+=======================================================================+
13867     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
13868     |- ".hs"      |              |for the definition of the keys included in that map.                   |
13869     |- ".es"      |              |                                                                       |
13870     |- ".gs"      |              |                                                                       |
13871     |- ".vs"      |              |                                                                       |
13872     |- ".ps"      |              |                                                                       |
13873     |- ".cs"      |              |                                                                       |
13874     +-------------+--------------+-----------------------------------------------------------------------+
13875
13876..
13877
13878  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
13879     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
13880
13881     ========================== ============== ========= ===============================================================
13882     String Key                 Value Type     Required? Description
13883     ========================== ============== ========= ===============================================================
13884     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
13885     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
13886     ".lds_size"                integer                  Local Data Share size in bytes.
13887     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
13888     ".vgpr_count"              integer                  Number of VGPRs used.
13889     ".agpr_count"              integer                  Number of AGPRs used.
13890     ".sgpr_count"              integer                  Number of SGPRs used.
13891     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
13892                                                         directive to instruct the compiler to limit the VGPR usage to
13893                                                         be less than or equal to the specified value (only set if
13894                                                         different from HW default).
13895     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
13896                                                         default).
13897     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
13898                                3 integers
13899     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
13900     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
13901     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
13902     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
13903     ".writes_depth"            boolean                  The shader writes out a depth value.
13904     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
13905                                                         memory or GDS.
13906     ".uses_prim_id"            boolean                  The shader uses PrimID.
13907     ========================== ============== ========= ===============================================================
13908
13909..
13910
13911  .. table:: AMDPAL Code Object Shader Function Map
13912     :name: amdgpu-amdpal-code-object-shader-function-map-table
13913
13914     =============== ============== ====================================================================
13915     String Key      Value Type     Description
13916     =============== ============== ====================================================================
13917     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
13918                                    entry address. The value is the function's metadata. See
13919                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
13920     =============== ============== ====================================================================
13921
13922..
13923
13924  .. table:: AMDPAL Code Object Shader Function Metadata Map
13925     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
13926
13927     ============================= ============== =================================================================
13928     String Key                    Value Type     Description
13929     ============================= ============== =================================================================
13930     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
13931                                   2 integers     is implementation defined, and can not be relied on between
13932                                                  different builds of the compiler.
13933     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
13934     ".lds_size"                   integer        Size in bytes of LDS memory.
13935     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
13936     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
13937     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
13938     ".shader_subtype"             string         Shader subtype/kind. Values include:
13939
13940                                                    - "Unknown"
13941
13942     ============================= ============== =================================================================
13943
13944..
13945
13946  .. table:: AMDPAL Code Object Register Map
13947     :name: amdgpu-amdpal-code-object-register-map-table
13948
13949     ========================== ============== ====================================================================
13950     32-bit Integer Key         Value Type     Description
13951     ========================== ============== ====================================================================
13952     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
13953                                               a GRBM register (i.e., driver accessible GPU register number, not
13954                                               shader GPR register number). The driver is required to program each
13955                                               specified register to the corresponding specified value when
13956                                               executing this pipeline. Typically, the ``reg offsets`` are the
13957                                               ``uint16_t`` offsets to each register as defined by the hardware
13958                                               chip headers. The register is set to the provided value. However, a
13959                                               ``reg offset`` that specifies a user data register (e.g.,
13960                                               COMPUTE_USER_DATA_0) needs special treatment. See
13961                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
13962                                               information.
13963     ========================== ============== ====================================================================
13964
13965.. _amdgpu-amdpal-code-object-user-data-section:
13966
13967User Data
13968+++++++++
13969
13970Each hardware stage has a set of 32-bit physical SPI *user data registers*
13971(either 16 or 32 based on graphics IP and the stage) which can be
13972written from a command buffer and then loaded into SGPRs when waves are
13973launched via a subsequent dispatch or draw operation. This is the way
13974most arguments are passed from the application/runtime to a hardware
13975shader.
13976
13977PAL abstracts this functionality by exposing a set of 128 *user data
13978entries* per pipeline a client can use to pass arguments from a command
13979buffer to one or more shaders in that pipeline. The ELF code object must
13980specify a mapping from virtualized *user data entries* to physical *user
13981data registers*, and PAL is responsible for implementing that mapping,
13982including spilling overflow *user data entries* to memory if needed.
13983
13984Since the *user data registers* are GRBM-accessible SPI registers, this
13985mapping is actually embedded in the ``.registers`` metadata entry. For
13986most registers, the value in that map is a literal 32-bit value that
13987should be written to the register by the driver. However, when the
13988register is a *user data register* (any USER_DATA register e.g.,
13989SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
13990the driver to write either a *user data entry* value or one of several
13991driver-internal values to the register. This encoding is described in
13992the following table:
13993
13994.. note::
13995
13996  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
13997  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
13998  always be programmed to the address of the GlobalTable, and *user data
13999  register* 1 must always be programmed to the address of the PerShaderTable.
14000
14001..
14002
14003  .. table:: AMDPAL User Data Mapping
14004     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14005
14006     ==========  =================  ===============================================================================
14007     Value       Name               Description
14008     ==========  =================  ===============================================================================
14009     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14010     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
14011                                    always point to *user data register* 0).
14012     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
14013                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14014                                    for more detail (should always point to *user data register* 1).
14015     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
14016                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14017                                    more detail.
14018     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14019                                    reference the draw index in the vertex shader. Only supported by the first
14020                                    stage in a graphics pipeline.
14021     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
14022                                    a graphics pipeline.
14023     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
14024                                    graphics pipeline.
14025     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14026                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
14027                                    high half of the address is stored in the next sequential user-SGPR. Only
14028                                    supported by compute pipelines.
14029     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
14030                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
14031                                    stages.
14032     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
14033                                    pipeline instancing.
14034     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
14035                                    can only appear for one shader stage per pipeline.
14036     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
14037     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
14038                                    only appear for one shader stage per pipeline.
14039     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
14040                                    only appear for one shader stage per pipeline (PS). These replace color targets
14041                                    and are completely separate from any UAVs used by the shader. This is optional,
14042                                    and only used by the PS when UAV exports are used to replace color-target
14043                                    exports to optimize specific shaders.
14044     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
14045                                    some NGG pipelines to perform culling.  This value contains the address of the
14046                                    first of two consecutive registers which provide the full GPU address.
14047     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
14048     ==========  =================  ===============================================================================
14049
14050.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14051
14052Per-Shader Table
14053################
14054
14055Low 32 bits of the GPU address for an optional buffer in the ``.data``
14056section of the ELF. The high 32 bits of the address match the high 32 bits
14057of the shader's program counter.
14058
14059The buffer can be anything the shader compiler needs it for, and
14060allows each shader to have its own region of the ``.data`` section.
14061Typically, this could be a table of buffer SRD's and the data pointed to
14062by the buffer SRD's, but it could be a flat-address region of memory as
14063well. Its layout and usage are defined by the shader compiler.
14064
14065Each shader's table in the ``.data`` section is referenced by the symbol
14066``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
14067hardware shader stage the data is for. E.g.,
14068``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14069
14070.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14071
14072Spill Table
14073###########
14074
14075It is possible for a hardware shader to need access to more *user data
14076entries* than there are slots available in user data registers for one
14077or more hardware shader stages. In that case, the PAL runtime expects
14078the necessary *user data entries* to be spilled to GPU memory and use
14079one user data register to point to the spilled user data memory. The
14080value of the *user data entry* must then represent the location where
14081a shader expects to read the low 32-bits of the table's GPU virtual
14082address. The *spill table* itself represents a set of 32-bit values
14083managed by the PAL runtime in GPU-accessible memory that can be made
14084indirectly accessible to a hardware shader.
14085
14086Unspecified OS
14087--------------
14088
14089This section provides code conventions used when the target triple OS is
14090empty (see :ref:`amdgpu-target-triples`).
14091
14092Trap Handler ABI
14093~~~~~~~~~~~~~~~~
14094
14095For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14096not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14097instructions are handled as follows:
14098
14099  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14100     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14101
14102     =============== =============== ===========================================
14103     Usage           Code Sequence   Description
14104     =============== =============== ===========================================
14105     llvm.trap       s_endpgm        Causes wavefront to be terminated.
14106     llvm.debugtrap  *none*          Compiler warning given that there is no
14107                                     trap handler installed.
14108     =============== =============== ===========================================
14109
14110Source Languages
14111================
14112
14113.. _amdgpu-opencl:
14114
14115OpenCL
14116------
14117
14118When the language is OpenCL the following differences occur:
14119
141201. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
141212. The AMDGPU backend appends additional arguments to the kernel's explicit
14122   arguments for the AMDHSA OS (see
14123   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
141243. Additional metadata is generated
14125   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14126
14127  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14128     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14129
14130     ======== ==== ========= ===========================================
14131     Position Byte Byte      Description
14132              Size Alignment
14133     ======== ==== ========= ===========================================
14134     1        8    8         OpenCL Global Offset X
14135     2        8    8         OpenCL Global Offset Y
14136     3        8    8         OpenCL Global Offset Z
14137     4        8    8         OpenCL address of printf buffer
14138     5        8    8         OpenCL address of virtual queue used by
14139                             enqueue_kernel.
14140     6        8    8         OpenCL address of AqlWrap struct used by
14141                             enqueue_kernel.
14142     7        8    8         Pointer argument used for Multi-gird
14143                             synchronization.
14144     ======== ==== ========= ===========================================
14145
14146.. _amdgpu-hcc:
14147
14148HCC
14149---
14150
14151When the language is HCC the following differences occur:
14152
141531. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14154
14155.. _amdgpu-assembler:
14156
14157Assembler
14158---------
14159
14160AMDGPU backend has LLVM-MC based assembler which is currently in development.
14161It supports AMDGCN GFX6-GFX10.
14162
14163This section describes general syntax for instructions and operands.
14164
14165Instructions
14166~~~~~~~~~~~~
14167
14168An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14169
14170  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14171    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14172
14173:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14174:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14175
14176The order of operands and modifiers is fixed.
14177Most modifiers are optional and may be omitted.
14178
14179Links to detailed instruction syntax description may be found in the following
14180table. Note that features under development are not included
14181in this description.
14182
14183    =================================== =======================================
14184    Core ISA                            ISA Extensions
14185    =================================== =======================================
14186    :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
14187    :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
14188    :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14189
14190                                        :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14191
14192                                        :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14193
14194                                        :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14195
14196                                        :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14197
14198                                        :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14199
14200                                        :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14201
14202    :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14203
14204                                        :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14205    =================================== =======================================
14206
14207For more information about instructions, their semantics and supported
14208combinations of operands, refer to one of instruction set architecture manuals
14209[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14210[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
14211[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
14212
14213Operands
14214~~~~~~~~
14215
14216Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14217
14218Modifiers
14219~~~~~~~~~
14220
14221Detailed description of modifiers may be found
14222:doc:`here<AMDGPUModifierSyntax>`.
14223
14224Instruction Examples
14225~~~~~~~~~~~~~~~~~~~~
14226
14227DS
14228++
14229
14230.. code-block:: nasm
14231
14232  ds_add_u32 v2, v4 offset:16
14233  ds_write_src2_b64 v2 offset0:4 offset1:8
14234  ds_cmpst_f32 v2, v4, v6
14235  ds_min_rtn_f64 v[8:9], v2, v[4:5]
14236
14237For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14238Manual.
14239
14240FLAT
14241++++
14242
14243.. code-block:: nasm
14244
14245  flat_load_dword v1, v[3:4]
14246  flat_store_dwordx3 v[3:4], v[5:7]
14247  flat_atomic_swap v1, v[3:4], v5 glc
14248  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14249  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14250
14251For full list of supported instructions, refer to "FLAT instructions" in ISA
14252Manual.
14253
14254MUBUF
14255+++++
14256
14257.. code-block:: nasm
14258
14259  buffer_load_dword v1, off, s[4:7], s1
14260  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14261  buffer_store_format_xy v[1:2], off, s[4:7], s1
14262  buffer_wbinvl1
14263  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14264
14265For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14266Manual.
14267
14268SMRD/SMEM
14269+++++++++
14270
14271.. code-block:: nasm
14272
14273  s_load_dword s1, s[2:3], 0xfc
14274  s_load_dwordx8 s[8:15], s[2:3], s4
14275  s_load_dwordx16 s[88:103], s[2:3], s4
14276  s_dcache_inv_vol
14277  s_memtime s[4:5]
14278
14279For full list of supported instructions, refer to "Scalar Memory Operations" in
14280ISA Manual.
14281
14282SOP1
14283++++
14284
14285.. code-block:: nasm
14286
14287  s_mov_b32 s1, s2
14288  s_mov_b64 s[0:1], 0x80000000
14289  s_cmov_b32 s1, 200
14290  s_wqm_b64 s[2:3], s[4:5]
14291  s_bcnt0_i32_b64 s1, s[2:3]
14292  s_swappc_b64 s[2:3], s[4:5]
14293  s_cbranch_join s[4:5]
14294
14295For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14296Manual.
14297
14298SOP2
14299++++
14300
14301.. code-block:: nasm
14302
14303  s_add_u32 s1, s2, s3
14304  s_and_b64 s[2:3], s[4:5], s[6:7]
14305  s_cselect_b32 s1, s2, s3
14306  s_andn2_b32 s2, s4, s6
14307  s_lshr_b64 s[2:3], s[4:5], s6
14308  s_ashr_i32 s2, s4, s6
14309  s_bfm_b64 s[2:3], s4, s6
14310  s_bfe_i64 s[2:3], s[4:5], s6
14311  s_cbranch_g_fork s[4:5], s[6:7]
14312
14313For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14314Manual.
14315
14316SOPC
14317++++
14318
14319.. code-block:: nasm
14320
14321  s_cmp_eq_i32 s1, s2
14322  s_bitcmp1_b32 s1, s2
14323  s_bitcmp0_b64 s[2:3], s4
14324  s_setvskip s3, s5
14325
14326For full list of supported instructions, refer to "SOPC Instructions" in ISA
14327Manual.
14328
14329SOPP
14330++++
14331
14332.. code-block:: nasm
14333
14334  s_barrier
14335  s_nop 2
14336  s_endpgm
14337  s_waitcnt 0 ; Wait for all counters to be 0
14338  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14339  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14340  s_sethalt 9
14341  s_sleep 10
14342  s_sendmsg 0x1
14343  s_sendmsg sendmsg(MSG_INTERRUPT)
14344  s_trap 1
14345
14346For full list of supported instructions, refer to "SOPP Instructions" in ISA
14347Manual.
14348
14349Unless otherwise mentioned, little verification is performed on the operands
14350of SOPP Instructions, so it is up to the programmer to be familiar with the
14351range or acceptable values.
14352
14353VALU
14354++++
14355
14356For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14357the assembler will automatically use optimal encoding based on its operands. To
14358force specific encoding, one can add a suffix to the opcode of the instruction:
14359
14360* _e32 for 32-bit VOP1/VOP2/VOPC
14361* _e64 for 64-bit VOP3
14362* _dpp for VOP_DPP
14363* _sdwa for VOP_SDWA
14364
14365VOP1/VOP2/VOP3/VOPC examples:
14366
14367.. code-block:: nasm
14368
14369  v_mov_b32 v1, v2
14370  v_mov_b32_e32 v1, v2
14371  v_nop
14372  v_cvt_f64_i32_e32 v[1:2], v2
14373  v_floor_f32_e32 v1, v2
14374  v_bfrev_b32_e32 v1, v2
14375  v_add_f32_e32 v1, v2, v3
14376  v_mul_i32_i24_e64 v1, v2, 3
14377  v_mul_i32_i24_e32 v1, -3, v3
14378  v_mul_i32_i24_e32 v1, -100, v3
14379  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14380  v_max_f16_e32 v1, v2, v3
14381
14382VOP_DPP examples:
14383
14384.. code-block:: nasm
14385
14386  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14387  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14388  v_mov_b32 v0, v0 wave_shl:1
14389  v_mov_b32 v0, v0 row_mirror
14390  v_mov_b32 v0, v0 row_bcast:31
14391  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14392  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14393  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14394
14395VOP_SDWA examples:
14396
14397.. code-block:: nasm
14398
14399  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14400  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14401  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14402  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14403  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14404
14405For full list of supported instructions, refer to "Vector ALU instructions".
14406
14407.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14408
14409Code Object V2 Predefined Symbols
14410~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14411
14412.. warning::
14413  Code object V2 is not the default code object version emitted by
14414  this version of LLVM.
14415
14416The AMDGPU assembler defines and updates some symbols automatically. These
14417symbols do not affect code generation.
14418
14419.option.machine_version_major
14420+++++++++++++++++++++++++++++
14421
14422Set to the GFX major generation number of the target being assembled for. For
14423example, when assembling for a "GFX9" target this will be set to the integer
14424value "9". The possible GFX major generation numbers are presented in
14425:ref:`amdgpu-processors`.
14426
14427.option.machine_version_minor
14428+++++++++++++++++++++++++++++
14429
14430Set to the GFX minor generation number of the target being assembled for. For
14431example, when assembling for a "GFX810" target this will be set to the integer
14432value "1". The possible GFX minor generation numbers are presented in
14433:ref:`amdgpu-processors`.
14434
14435.option.machine_version_stepping
14436++++++++++++++++++++++++++++++++
14437
14438Set to the GFX stepping generation number of the target being assembled for.
14439For example, when assembling for a "GFX704" target this will be set to the
14440integer value "4". The possible GFX stepping generation numbers are presented
14441in :ref:`amdgpu-processors`.
14442
14443.kernel.vgpr_count
14444++++++++++++++++++
14445
14446Set to zero each time a
14447:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14448encountered. At each instruction, if the current value of this symbol is less
14449than or equal to the maximum VGPR number explicitly referenced within that
14450instruction then the symbol value is updated to equal that VGPR number plus
14451one.
14452
14453.kernel.sgpr_count
14454++++++++++++++++++
14455
14456Set to zero each time a
14457:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14458encountered. At each instruction, if the current value of this symbol is less
14459than or equal to the maximum VGPR number explicitly referenced within that
14460instruction then the symbol value is updated to equal that SGPR number plus
14461one.
14462
14463.. _amdgpu-amdhsa-assembler-directives-v2:
14464
14465Code Object V2 Directives
14466~~~~~~~~~~~~~~~~~~~~~~~~~
14467
14468.. warning::
14469  Code object V2 is not the default code object version emitted by
14470  this version of LLVM.
14471
14472AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14473one can specify them with assembler directives.
14474
14475.hsa_code_object_version major, minor
14476+++++++++++++++++++++++++++++++++++++
14477
14478*major* and *minor* are integers that specify the version of the HSA code
14479object that will be generated by the assembler.
14480
14481.hsa_code_object_isa [major, minor, stepping, vendor, arch]
14482+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14483
14484
14485*major*, *minor*, and *stepping* are all integers that describe the instruction
14486set architecture (ISA) version of the assembly program.
14487
14488*vendor* and *arch* are quoted strings. *vendor* should always be equal to
14489"AMD" and *arch* should always be equal to "AMDGPU".
14490
14491By default, the assembler will derive the ISA version, *vendor*, and *arch*
14492from the value of the -mcpu option that is passed to the assembler.
14493
14494.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14495
14496.amdgpu_hsa_kernel (name)
14497+++++++++++++++++++++++++
14498
14499This directives specifies that the symbol with given name is a kernel entry
14500point (label) and the object should contain corresponding symbol of type
14501STT_AMDGPU_HSA_KERNEL.
14502
14503.amd_kernel_code_t
14504++++++++++++++++++
14505
14506This directive marks the beginning of a list of key / value pairs that are used
14507to specify the amd_kernel_code_t object that will be emitted by the assembler.
14508The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14509amd_kernel_code_t values that are unspecified a default value will be used. The
14510default value for all keys is 0, with the following exceptions:
14511
14512- *amd_code_version_major* defaults to 1.
14513- *amd_kernel_code_version_minor* defaults to 2.
14514- *amd_machine_kind* defaults to 1.
14515- *amd_machine_version_major*, *machine_version_minor*, and
14516  *amd_machine_version_stepping* are derived from the value of the -mcpu option
14517  that is passed to the assembler.
14518- *kernel_code_entry_byte_offset* defaults to 256.
14519- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14520  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14521  Note that wavefront size is specified as a power of two, so a value of **n**
14522  means a size of 2^ **n**.
14523- *call_convention* defaults to -1.
14524- *kernarg_segment_alignment*, *group_segment_alignment*, and
14525  *private_segment_alignment* default to 4. Note that alignments are specified
14526  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14527- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14528  GFX90A onwards.
14529- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14530  GFX10 onwards.
14531- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14532
14533The *.amd_kernel_code_t* directive must be placed immediately after the
14534function label and before any instructions.
14535
14536For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14537comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14538
14539.. _amdgpu-amdhsa-assembler-example-v2:
14540
14541Code Object V2 Example Source Code
14542~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14543
14544.. warning::
14545  Code Object V2 is not the default code object version emitted by
14546  this version of LLVM.
14547
14548Here is an example of a minimal assembly source file, defining one HSA kernel:
14549
14550.. code::
14551   :number-lines:
14552
14553   .hsa_code_object_version 1,0
14554   .hsa_code_object_isa
14555
14556   .hsatext
14557   .globl  hello_world
14558   .p2align 8
14559   .amdgpu_hsa_kernel hello_world
14560
14561   hello_world:
14562
14563      .amd_kernel_code_t
14564         enable_sgpr_kernarg_segment_ptr = 1
14565         is_ptr64 = 1
14566         compute_pgm_rsrc1_vgprs = 0
14567         compute_pgm_rsrc1_sgprs = 0
14568         compute_pgm_rsrc2_user_sgpr = 2
14569         compute_pgm_rsrc1_wgp_mode = 0
14570         compute_pgm_rsrc1_mem_ordered = 0
14571         compute_pgm_rsrc1_fwd_progress = 1
14572     .end_amd_kernel_code_t
14573
14574     s_load_dwordx2 s[0:1], s[0:1] 0x0
14575     v_mov_b32 v0, 3.14159
14576     s_waitcnt lgkmcnt(0)
14577     v_mov_b32 v1, s0
14578     v_mov_b32 v2, s1
14579     flat_store_dword v[1:2], v0
14580     s_endpgm
14581   .Lfunc_end0:
14582        .size   hello_world, .Lfunc_end0-hello_world
14583
14584.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14585
14586Code Object V3 and Above Predefined Symbols
14587~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14588
14589The AMDGPU assembler defines and updates some symbols automatically. These
14590symbols do not affect code generation.
14591
14592.amdgcn.gfx_generation_number
14593+++++++++++++++++++++++++++++
14594
14595Set to the GFX major generation number of the target being assembled for. For
14596example, when assembling for a "GFX9" target this will be set to the integer
14597value "9". The possible GFX major generation numbers are presented in
14598:ref:`amdgpu-processors`.
14599
14600.amdgcn.gfx_generation_minor
14601++++++++++++++++++++++++++++
14602
14603Set to the GFX minor generation number of the target being assembled for. For
14604example, when assembling for a "GFX810" target this will be set to the integer
14605value "1". The possible GFX minor generation numbers are presented in
14606:ref:`amdgpu-processors`.
14607
14608.amdgcn.gfx_generation_stepping
14609+++++++++++++++++++++++++++++++
14610
14611Set to the GFX stepping generation number of the target being assembled for.
14612For example, when assembling for a "GFX704" target this will be set to the
14613integer value "4". The possible GFX stepping generation numbers are presented
14614in :ref:`amdgpu-processors`.
14615
14616.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14617
14618.amdgcn.next_free_vgpr
14619++++++++++++++++++++++
14620
14621Set to zero before assembly begins. At each instruction, if the current value
14622of this symbol is less than or equal to the maximum VGPR number explicitly
14623referenced within that instruction then the symbol value is updated to equal
14624that VGPR number plus one.
14625
14626May be used to set the `.amdhsa_next_free_vgpr` directive in
14627:ref:`amdhsa-kernel-directives-table`.
14628
14629May be set at any time, e.g. manually set to zero at the start of each kernel.
14630
14631.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14632
14633.amdgcn.next_free_sgpr
14634++++++++++++++++++++++
14635
14636Set to zero before assembly begins. At each instruction, if the current value
14637of this symbol is less than or equal the maximum SGPR number explicitly
14638referenced within that instruction then the symbol value is updated to equal
14639that SGPR number plus one.
14640
14641May be used to set the `.amdhsa_next_free_spgr` directive in
14642:ref:`amdhsa-kernel-directives-table`.
14643
14644May be set at any time, e.g. manually set to zero at the start of each kernel.
14645
14646.. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14647
14648Code Object V3 and Above Directives
14649~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14650
14651Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14652architecture processors, and are not OS-specific. Directives which begin with
14653``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14654``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14655:ref:`amdgpu-processors`.
14656
14657.. _amdgpu-assembler-directive-amdgcn-target:
14658
14659.amdgcn_target <target-triple> "-" <target-id>
14660++++++++++++++++++++++++++++++++++++++++++++++
14661
14662Optional directive which declares the ``<target-triple>-<target-id>`` supported
14663by the containing assembler source file. Used by the assembler to validate
14664command-line options such as ``-triple``, ``-mcpu``, and
14665``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14666:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14667
14668.. note::
14669
14670  The target ID syntax used for code object V2 to V3 for this directive differs
14671  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14672
14673.amdhsa_kernel <name>
14674+++++++++++++++++++++
14675
14676Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14677``<name>.kd``, in the current location of the current section. Only valid when
14678the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14679instruction to execute, and does not need to be previously defined.
14680
14681Marks the beginning of a list of directives used to generate the bytes of a
14682kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14683Directives which may appear in this list are described in
14684:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14685be valid for the target being assembled for, and cannot be repeated. Directives
14686support the range of values specified by the field they reference in
14687:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14688assumed to have its default value, unless it is marked as "Required", in which
14689case it is an error to omit the directive. This list of directives is
14690terminated by an ``.end_amdhsa_kernel`` directive.
14691
14692  .. table:: AMDHSA Kernel Assembler Directives
14693     :name: amdhsa-kernel-directives-table
14694
14695     ======================================================== =================== ============ ===================
14696     Directive                                                Default             Supported On Description
14697     ======================================================== =================== ============ ===================
14698     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
14699                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14700     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
14701                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14702     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX10   Controls KERNARG_SIZE in
14703                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14704     ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX10   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14705                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`
14706     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14707                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14708                                                                                  GFX940)
14709     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
14710                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14711     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
14712                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14713     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14714                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14715     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
14716                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14717     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14718                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14719                                                                                  GFX940)
14720     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14721                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14722     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
14723                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14724                                                              Specific
14725                                                              (wavefrontsize64)
14726     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
14727                                                                                  (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14728                                                                                  GFX940)
14729     ``.amdhsa_enable_private_segment``                       0                   GFX940       Controls ENABLE_PRIVATE_SEGMENT in
14730                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14731     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
14732                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14733     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14734                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14735     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14736                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14737     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
14738                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14739     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
14740                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14741                                                                                               Possible values are defined in
14742                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14743     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
14744                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14745                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14746     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
14747                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14748                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14749     ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
14750                                                                                  GFX940       Used to calculate ACCUM_OFFSET in
14751                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14752     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
14753                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14754                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14755     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
14756                                                                                  (except      scratch memory. Used to calculate
14757                                                                                  GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
14758                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14759     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
14760                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14761                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14762                                                              (xnack)
14763     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
14764                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14765                                                                                               Possible values are defined in
14766                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14767     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
14768                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14769                                                                                               Possible values are defined in
14770                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14771     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
14772                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14773                                                                                               Possible values are defined in
14774                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14775     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
14776                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14777                                                                                               Possible values are defined in
14778                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14779     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
14780                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14781     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
14782                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14783     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
14784                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14785     ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
14786                                                              Feature             GFX940       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14787                                                              Specific
14788                                                              (tgsplit)
14789     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
14790                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14791                                                              Specific
14792                                                              (cumode)
14793     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
14794                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14795     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
14796                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
14797     ``.amdhsa_shared_vgpr_count``                            0                   GFX10        Controls SHARED_VGPR_COUNT in
14798                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
14799     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
14800                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14801     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
14802                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14803     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
14804                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14805     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
14806                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14807     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
14808                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14809     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
14810                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14811     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
14812                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
14813     ======================================================== =================== ============ ===================
14814
14815.amdgpu_metadata
14816++++++++++++++++
14817
14818Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
14819note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
14820
14821The contents must be in the [YAML]_ markup format, with the same structure and
14822semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
14823:ref:`amdgpu-amdhsa-code-object-metadata-v4` or
14824:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
14825
14826This directive is terminated by an ``.end_amdgpu_metadata`` directive.
14827
14828.. _amdgpu-amdhsa-assembler-example-v3-onwards:
14829
14830Code Object V3 and Above Example Source Code
14831~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14832
14833Here is an example of a minimal assembly source file, defining one HSA kernel:
14834
14835.. code::
14836   :number-lines:
14837
14838   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
14839
14840   .text
14841   .globl hello_world
14842   .p2align 8
14843   .type hello_world,@function
14844   hello_world:
14845     s_load_dwordx2 s[0:1], s[0:1] 0x0
14846     v_mov_b32 v0, 3.14159
14847     s_waitcnt lgkmcnt(0)
14848     v_mov_b32 v1, s0
14849     v_mov_b32 v2, s1
14850     flat_store_dword v[1:2], v0
14851     s_endpgm
14852   .Lfunc_end0:
14853     .size   hello_world, .Lfunc_end0-hello_world
14854
14855   .rodata
14856   .p2align 6
14857   .amdhsa_kernel hello_world
14858     .amdhsa_user_sgpr_kernarg_segment_ptr 1
14859     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
14860     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
14861   .end_amdhsa_kernel
14862
14863   .amdgpu_metadata
14864   ---
14865   amdhsa.version:
14866     - 1
14867     - 0
14868   amdhsa.kernels:
14869     - .name: hello_world
14870       .symbol: hello_world.kd
14871       .kernarg_segment_size: 48
14872       .group_segment_fixed_size: 0
14873       .private_segment_fixed_size: 0
14874       .kernarg_segment_align: 4
14875       .wavefront_size: 64
14876       .sgpr_count: 2
14877       .vgpr_count: 3
14878       .max_flat_workgroup_size: 256
14879       .args:
14880         - .size: 8
14881           .offset: 0
14882           .value_kind: global_buffer
14883           .address_space: global
14884           .actual_access: write_only
14885   //...
14886   .end_amdgpu_metadata
14887
14888This kernel is equivalent to the following HIP program:
14889
14890.. code::
14891   :number-lines:
14892
14893   __global__ void hello_world(float *p) {
14894       *p = 3.14159f;
14895   }
14896
14897If an assembly source file contains multiple kernels and/or functions, the
14898:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
14899:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
14900the ``.set <symbol>, <expression>`` directive. For example, in the case of two
14901kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
14902to group the function with the kernel that calls it and reset the symbols
14903between the two connected components:
14904
14905.. code::
14906   :number-lines:
14907
14908   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
14909
14910   // gpr tracking symbols are implicitly set to zero
14911
14912   .text
14913   .globl kern0
14914   .p2align 8
14915   .type kern0,@function
14916   kern0:
14917     // ...
14918     s_endpgm
14919   .Lkern0_end:
14920     .size   kern0, .Lkern0_end-kern0
14921
14922   .rodata
14923   .p2align 6
14924   .amdhsa_kernel kern0
14925     // ...
14926     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
14927     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
14928   .end_amdhsa_kernel
14929
14930   // reset symbols to begin tracking usage in func1 and kern1
14931   .set .amdgcn.next_free_vgpr, 0
14932   .set .amdgcn.next_free_sgpr, 0
14933
14934   .text
14935   .hidden func1
14936   .global func1
14937   .p2align 2
14938   .type func1,@function
14939   func1:
14940     // ...
14941     s_setpc_b64 s[30:31]
14942   .Lfunc1_end:
14943   .size func1, .Lfunc1_end-func1
14944
14945   .globl kern1
14946   .p2align 8
14947   .type kern1,@function
14948   kern1:
14949     // ...
14950     s_getpc_b64 s[4:5]
14951     s_add_u32 s4, s4, func1@rel32@lo+4
14952     s_addc_u32 s5, s5, func1@rel32@lo+4
14953     s_swappc_b64 s[30:31], s[4:5]
14954     // ...
14955     s_endpgm
14956   .Lkern1_end:
14957     .size   kern1, .Lkern1_end-kern1
14958
14959   .rodata
14960   .p2align 6
14961   .amdhsa_kernel kern1
14962     // ...
14963     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
14964     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
14965   .end_amdhsa_kernel
14966
14967These symbols cannot identify connected components in order to automatically
14968track the usage for each kernel. However, in some cases careful organization of
14969the kernels and functions in the source file means there is minimal additional
14970effort required to accurately calculate GPR usage.
14971
14972Additional Documentation
14973========================
14974
14975.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
14976.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
14977.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
14978.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
14979.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
14980.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
14981.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
14982.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
14983.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
14984.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
14985.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
14986.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
14987.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
14988.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
14989.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
14990.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
14991.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
14992.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
14993.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
14994.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
14995.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
14996.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
14997.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
14998.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
14999