1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX940
20   AMDGPU/AMDGPUAsmGFX10
21   AMDGPU/AMDGPUAsmGFX1011
22   AMDGPU/AMDGPUAsmGFX1013
23   AMDGPU/AMDGPUAsmGFX1030
24   AMDGPUModifierSyntax
25   AMDGPUOperandSyntax
26   AMDGPUInstructionSyntax
27   AMDGPUInstructionNotation
28   AMDGPUDwarfExtensionsForHeterogeneousDebugging
29   AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
30
31Introduction
32============
33
34The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
35R600 family up until the current GCN families. It lives in the
36``llvm/lib/Target/AMDGPU`` directory.
37
38LLVM
39====
40
41.. _amdgpu-target-triples:
42
43Target Triples
44--------------
45
46Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
47to specify the target triple:
48
49  .. table:: AMDGPU Architectures
50     :name: amdgpu-architecture-table
51
52     ============ ==============================================================
53     Architecture Description
54     ============ ==============================================================
55     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
56     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
57     ============ ==============================================================
58
59  .. table:: AMDGPU Vendors
60     :name: amdgpu-vendor-table
61
62     ============ ==============================================================
63     Vendor       Description
64     ============ ==============================================================
65     ``amd``      Can be used for all AMD GPU usage.
66     ``mesa3d``   Can be used if the OS is ``mesa3d``.
67     ============ ==============================================================
68
69  .. table:: AMDGPU Operating Systems
70     :name: amdgpu-os
71
72     ============== ============================================================
73     OS             Description
74     ============== ============================================================
75     *<empty>*      Defaults to the *unknown* OS.
76     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
77                    such as:
78
79                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
80                      loader on Linux. See *AMD ROCm Platform Release Notes*
81                      [AMD-ROCm-Release-Notes]_ for supported hardware and
82                      software.
83                    - AMD's PAL runtime using the *pal-amdhsa* loader on
84                      Windows.
85
86     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
87                    runtime using the *pal-amdpal* loader on Windows and Linux
88                    Pro.
89     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
90                    3D runtime using the *mesa-mesa3d* loader on Linux.
91     ============== ============================================================
92
93  .. table:: AMDGPU Environments
94     :name: amdgpu-environment-table
95
96     ============ ==============================================================
97     Environment  Description
98     ============ ==============================================================
99     *<empty>*    Default.
100     ============ ==============================================================
101
102.. _amdgpu-processors:
103
104Processors
105----------
106
107Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
108specify the AMDGPU processor together with optional target features. See
109:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
110specific information.
111
112Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
113
114* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
115
116
117  .. table:: AMDGPU Processors
118     :name: amdgpu-processor-table
119
120     =========== =============== ============ ===== ================= =============== =============== ======================
121     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
122                 Processor       Triple       APU   Features          Properties      *(see*          Products
123                                 Architecture       Supported                         `amdgpu-os`_
124                                                                                      *and
125                                                                                      corresponding
126                                                                                      runtime release
127                                                                                      notes for
128                                                                                      current
129                                                                                      information and
130                                                                                      level of
131                                                                                      support)*
132     =========== =============== ============ ===== ================= =============== =============== ======================
133     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
134     -----------------------------------------------------------------------------------------------------------------------
135     ``r600``                    ``r600``     dGPU                    - Does not
136                                                                        support
137                                                                        generic
138                                                                        address
139                                                                        space
140     ``r630``                    ``r600``     dGPU                    - Does not
141                                                                        support
142                                                                        generic
143                                                                        address
144                                                                        space
145     ``rs880``                   ``r600``     dGPU                    - Does not
146                                                                        support
147                                                                        generic
148                                                                        address
149                                                                        space
150     ``rv670``                   ``r600``     dGPU                    - Does not
151                                                                        support
152                                                                        generic
153                                                                        address
154                                                                        space
155     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
156     -----------------------------------------------------------------------------------------------------------------------
157     ``rv710``                   ``r600``     dGPU                    - Does not
158                                                                        support
159                                                                        generic
160                                                                        address
161                                                                        space
162     ``rv730``                   ``r600``     dGPU                    - Does not
163                                                                        support
164                                                                        generic
165                                                                        address
166                                                                        space
167     ``rv770``                   ``r600``     dGPU                    - Does not
168                                                                        support
169                                                                        generic
170                                                                        address
171                                                                        space
172     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
173     -----------------------------------------------------------------------------------------------------------------------
174     ``cedar``                   ``r600``     dGPU                    - Does not
175                                                                        support
176                                                                        generic
177                                                                        address
178                                                                        space
179     ``cypress``                 ``r600``     dGPU                    - Does not
180                                                                        support
181                                                                        generic
182                                                                        address
183                                                                        space
184     ``juniper``                 ``r600``     dGPU                    - Does not
185                                                                        support
186                                                                        generic
187                                                                        address
188                                                                        space
189     ``redwood``                 ``r600``     dGPU                    - Does not
190                                                                        support
191                                                                        generic
192                                                                        address
193                                                                        space
194     ``sumo``                    ``r600``     dGPU                    - Does not
195                                                                        support
196                                                                        generic
197                                                                        address
198                                                                        space
199     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
200     -----------------------------------------------------------------------------------------------------------------------
201     ``barts``                   ``r600``     dGPU                    - Does not
202                                                                        support
203                                                                        generic
204                                                                        address
205                                                                        space
206     ``caicos``                  ``r600``     dGPU                    - Does not
207                                                                        support
208                                                                        generic
209                                                                        address
210                                                                        space
211     ``cayman``                  ``r600``     dGPU                    - Does not
212                                                                        support
213                                                                        generic
214                                                                        address
215                                                                        space
216     ``turks``                   ``r600``     dGPU                    - Does not
217                                                                        support
218                                                                        generic
219                                                                        address
220                                                                        space
221     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
222     -----------------------------------------------------------------------------------------------------------------------
223     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
224                                                                        support
225                                                                        generic
226                                                                        address
227                                                                        space
228     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
229                 - ``verde``                                            support
230                                                                        generic
231                                                                        address
232                                                                        space
233     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
234                 - ``oland``                                            support
235                                                                        generic
236                                                                        address
237                                                                        space
238     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
239     -----------------------------------------------------------------------------------------------------------------------
240     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
241                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
242                                                                        scratch       - *pal-amdpal*  - A8-7100
243                                                                                                      - A8 Pro-7150B
244                                                                                                      - A10-7300
245                                                                                                      - A10 Pro-7350B
246                                                                                                      - FX-7500
247                                                                                                      - A8-7200P
248                                                                                                      - A10-7400P
249                                                                                                      - FX-7600P
250     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
251                                                                        flat          - *pal-amdhsa*  - FirePro W9100
252                                                                        scratch       - *pal-amdpal*  - FirePro S9150
253                                                                                                      - FirePro S9170
254     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
255                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
256                                                                        scratch       - *pal-amdpal*  - Radeon R390
257                                                                                                      - Radeon R390x
258     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
259                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
260                                                                        scratch                       - E1-2500
261                                                                                                      - E2-3000
262                                                                                                      - E2-3800
263                                                                                                      - A4-5000
264                                                                                                      - A4-5100
265                                                                                                      - A6-5200
266                                                                                                      - A4 Pro-3340B
267     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
268                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
269                                                                        scratch                       - R7 260
270                                                                                                      - R7 260X
271     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
272                                                                        flat          - *pal-amdpal*
273                                                                        scratch                       .. TODO::
274
275                                                                                                        Add product
276                                                                                                        names.
277
278     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
279     -----------------------------------------------------------------------------------------------------------------------
280     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
281                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
282                                                                        scratch       - *pal-amdpal*  - A8-8600P
283                                                                                                      - Pro A8-8600B
284                                                                                                      - FX-8800P
285                                                                                                      - Pro A12-8800B
286                                                                                                      - A10-8700P
287                                                                                                      - Pro A10-8700B
288                                                                                                      - A10-8780P
289                                                                                                      - A10-9600P
290                                                                                                      - A10-9630P
291                                                                                                      - A12-9700P
292                                                                                                      - A12-9730P
293                                                                                                      - FX-9800P
294                                                                                                      - FX-9830P
295                                                                                                      - E2-9010
296                                                                                                      - A6-9210
297                                                                                                      - A9-9410
298     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
299                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
300                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
301     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
302                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
303                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
304                                                                                                      - Radeon Pro Duo
305                                                                                                      - FirePro S9300x2
306                                                                                                      - Radeon Instinct MI8
307     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
308                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
309                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
310     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
311                                                                        flat          - *pal-amdhsa*
312                                                                        scratch       - *pal-amdpal*
313     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
314                                                                        flat          - *pal-amdhsa*  - FirePro S7100
315                                                                        scratch       - *pal-amdpal*  - FirePro W7100
316                                                                                                      - Mobile FirePro
317                                                                                                        M7170
318     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
319                                                                        flat          - *pal-amdhsa*
320                                                                        scratch       - *pal-amdpal*  .. TODO::
321
322                                                                                                        Add product
323                                                                                                        names.
324
325     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
326     -----------------------------------------------------------------------------------------------------------------------
327     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
328                                                                        flat          - *pal-amdhsa*    Frontier Edition
329                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
330                                                                                                      - Radeon RX Vega 64
331                                                                                                      - Radeon RX Vega 64
332                                                                                                        Liquid
333                                                                                                      - Radeon Instinct MI25
334     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
335                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
336                                                                        scratch       - *pal-amdpal*
337     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
338                                                                                      - *pal-amdhsa*
339                                                                                      - *pal-amdpal*  .. TODO::
340
341                                                                                                        Add product
342                                                                                                        names.
343
344     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
345                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
346                                                                        scratch       - *pal-amdpal*  - Radeon VII
347                                                                                                      - Radeon Pro VII
348     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
349                                                    - xnack           - Absolute
350                                                                        flat
351                                                                        scratch
352     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
353                                                                        flat
354                                                                        scratch                       .. TODO::
355
356                                                                                                        Add product
357                                                                                                        names.
358
359     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
360                                                    - tgsplit           flat
361                                                    - xnack             scratch                       .. TODO::
362                                                                      - Packed
363                                                                        work-item                       Add product
364                                                                        IDs                             names.
365
366     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
367                                                                        flat                          - Ryzen 7 4700GE
368                                                                        scratch                       - Ryzen 5 4600G
369                                                                                                      - Ryzen 5 4600GE
370                                                                                                      - Ryzen 3 4300G
371                                                                                                      - Ryzen 3 4300GE
372                                                                                                      - Ryzen Pro 4000G
373                                                                                                      - Ryzen 7 Pro 4700G
374                                                                                                      - Ryzen 7 Pro 4750GE
375                                                                                                      - Ryzen 5 Pro 4650G
376                                                                                                      - Ryzen 5 Pro 4650GE
377                                                                                                      - Ryzen 3 Pro 4350G
378                                                                                                      - Ryzen 3 Pro 4350GE
379
380     ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
381                                                    - tgsplit           flat
382                                                    - xnack             scratch                       .. TODO::
383                                                                      - Packed
384                                                                        work-item                       Add product
385                                                                        IDs                             names.
386
387     **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
388     -----------------------------------------------------------------------------------------------------------------------
389     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
390                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
391                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
392                                                                                                      - Radeon Pro 5600M
393     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
394                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
395                                                    - xnack             flat          - *pal-amdpal*
396                                                                        scratch
397     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
398                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
399                                                    - xnack             scratch       - *pal-amdpal*
400     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
401                                                    - wavefrontsize64   flat          - *pal-amdhsa*
402                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
403
404                                                                                                        Add product
405                                                                                                        names.
406
407     **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
408     -----------------------------------------------------------------------------------------------------------------------
409     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
410                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
411                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
412     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
413                                                    - wavefrontsize64   flat          - *pal-amdhsa*
414                                                                        scratch       - *pal-amdpal*
415     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
416                                                    - wavefrontsize64   flat          - *pal-amdhsa*
417                                                                        scratch       - *pal-amdpal*  .. TODO::
418
419                                                                                                        Add product
420                                                                                                        names.
421
422     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
423                                                    - wavefrontsize64   flat
424                                                                        scratch                       .. TODO::
425
426                                                                                                        Add product
427                                                                                                        names.
428     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
429                                                    - wavefrontsize64   flat
430                                                                        scratch                       .. TODO::
431
432                                                                                                        Add product
433                                                                                                        names.
434
435     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
436                                                    - wavefrontsize64   flat
437                                                                        scratch                       .. TODO::
438                                                                                                        Add product
439                                                                                                        names.
440
441     ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
442                                                    - wavefrontsize64   flat
443                                                                        scratch                       .. TODO::
444
445                                                                                                        Add product
446                                                                                                        names.
447
448     **GCN GFX11**
449     -----------------------------------------------------------------------------------------------------------------------
450     ``gfx1100``                 ``amdgcn``   dGPU  - cumode          - Architected   - *pal-amdpal*  *TBA*
451                                                    - wavefrontsize64   flat
452                                                                        scratch                       .. TODO::
453                                                                      - Packed
454                                                                        work-item                       Add product
455                                                                        IDs                             names.
456
457     ``gfx1101``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
458                                                    - wavefrontsize64   flat
459                                                                        scratch                       .. TODO::
460                                                                      - Packed
461                                                                        work-item                       Add product
462                                                                        IDs                             names.
463
464     ``gfx1102``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
465                                                    - wavefrontsize64   flat
466                                                                        scratch                       .. TODO::
467                                                                      - Packed
468                                                                        work-item                       Add product
469                                                                        IDs                             names.
470
471     ``gfx1103``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
472                                                    - wavefrontsize64   flat
473                                                                        scratch                       .. TODO::
474                                                                      - Packed
475                                                                        work-item                       Add product
476                                                                        IDs                             names.
477
478     =========== =============== ============ ===== ================= =============== =============== ======================
479
480.. _amdgpu-target-features:
481
482Target Features
483---------------
484
485Target features control how code is generated to support certain
486processor specific features. Not all target features are supported by
487all processors. The runtime must ensure that the features supported by
488the device used to execute the code match the features enabled when
489generating the code. A mismatch of features may result in incorrect
490execution, or a reduction in performance.
491
492The target features supported by each processor is listed in
493:ref:`amdgpu-processor-table`.
494
495Target features are controlled by exactly one of the following Clang
496options:
497
498``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
499
500  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
501  optional components of the target ID. If omitted, the target feature has the
502  ``any`` value. See :ref:`amdgpu-target-id`.
503
504``-m[no-]<target-feature>``
505
506  Target features not specified by the target ID are specified using a
507  separate option. These target features can have an ``on`` or ``off``
508  value.  ``on`` is specified by omitting the ``no-`` prefix, and
509  ``off`` is specified by including the ``no-`` prefix. The default
510  if not specified is ``off``.
511
512For example:
513
514``-mcpu=gfx908:xnack+``
515  Enable the ``xnack`` feature.
516``-mcpu=gfx908:xnack-``
517  Disable the ``xnack`` feature.
518``-mcumode``
519  Enable the ``cumode`` feature.
520``-mno-cumode``
521  Disable the ``cumode`` feature.
522
523  .. table:: AMDGPU Target Features
524     :name: amdgpu-target-features-table
525
526     =============== ============================ ==================================================
527     Target Feature  Clang Option to Control      Description
528     Name
529     =============== ============================ ==================================================
530     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
531                                                  when generating code for kernels. When disabled
532                                                  native WGP wavefront execution mode is used,
533                                                  when enabled CU wavefront execution mode is used
534                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
535
536     sramecc         - ``-mcpu``                  If specified, generate code that can only be
537                     - ``--offload-arch``         loaded and executed in a process that has a
538                                                  matching setting for SRAMECC.
539
540                                                  If not specified for code object V2 to V3, generate
541                                                  code that can be loaded and executed in a process
542                                                  with SRAMECC enabled.
543
544                                                  If not specified for code object V4 or above, generate
545                                                  code that can be loaded and executed in a process
546                                                  with either setting of SRAMECC.
547
548     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
549                                                  work-groups are launched in threadgroup split mode.
550                                                  When enabled the waves of a work-group may be
551                                                  launched in different CUs.
552
553     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
554                                                  generating code for kernels. When disabled
555                                                  native wavefront size 32 is used, when enabled
556                                                  wavefront size 64 is used.
557
558     xnack           - ``-mcpu``                  If specified, generate code that can only be
559                     - ``--offload-arch``         loaded and executed in a process that has a
560                                                  matching setting for XNACK replay.
561
562                                                  If not specified for code object V2 to V3, generate
563                                                  code that can be loaded and executed in a process
564                                                  with XNACK replay enabled.
565
566                                                  If not specified for code object V4 or above, generate
567                                                  code that can be loaded and executed in a process
568                                                  with either setting of XNACK replay.
569
570                                                  XNACK replay can be used for demand paging and
571                                                  page migration. If enabled in the device, then if
572                                                  a page fault occurs the code may execute
573                                                  incorrectly unless generated with XNACK replay
574                                                  enabled, or generated for code object V4 or above without
575                                                  specifying XNACK replay. Executing code that was
576                                                  generated with XNACK replay enabled, or generated
577                                                  for code object V4 or above without specifying XNACK replay,
578                                                  on a device that does not have XNACK replay
579                                                  enabled will execute correctly but may be less
580                                                  performant than code generated for XNACK replay
581                                                  disabled.
582     =============== ============================ ==================================================
583
584.. _amdgpu-target-id:
585
586Target ID
587---------
588
589AMDGPU supports target IDs. See `Clang Offload Bundler
590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
591description. The AMDGPU target specific information is:
592
593**processor**
594  Is an AMDGPU processor or alternative processor name specified in
595  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
596  the primary processor and alternative processor names. The canonical form
597  target ID only allow the primary processor name.
598
599**target-feature**
600  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
601  is supported by the processor. The target features supported by each processor
602  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
603  a target ID are marked as being controlled by ``-mcpu`` and
604  ``--offload-arch``. Each target feature must appear at most once in a target
605  ID. The non-canonical form target ID allows the target features to be
606  specified in any order. The canonical form target ID requires the target
607  features to be specified in alphabetic order.
608
609.. _amdgpu-target-id-v2-v3:
610
611Code Object V2 to V3 Target ID
612~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
613
614The target ID syntax for code object V2 to V3 is the same as defined in `Clang
615Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
616when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
617directive and the bundle entry ID. In those cases it has the following BNF
618syntax:
619
620.. code::
621
622  <target-id> ::== <processor> ( "+" <target-feature> )*
623
624Where a target feature is omitted if *Off* and present if *On* or *Any*.
625
626.. note::
627
628  The code object V2 to V3 cannot represent *Any* and treats it the same as
629  *On*.
630
631.. _amdgpu-embedding-bundled-objects:
632
633Embedding Bundled Code Objects
634------------------------------
635
636AMDGPU supports the HIP and OpenMP languages that perform code object embedding
637as described in `Clang Offload Bundler
638<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
639
640.. note::
641
642  The target ID syntax used for code object V2 to V3 for a bundle entry ID
643  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
644
645.. _amdgpu-address-spaces:
646
647Address Spaces
648--------------
649
650The AMDGPU architecture supports a number of memory address spaces. The address
651space names use the OpenCL standard names, with some additions.
652
653The AMDGPU address spaces correspond to target architecture specific LLVM
654address space numbers used in LLVM IR.
655
656The AMDGPU address spaces are described in
657:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
658supported for the ``amdgcn`` target.
659
660  .. table:: AMDGPU Address Spaces
661     :name: amdgpu-address-spaces-table
662
663     ================================= =============== =========== ================ ======= ============================
664     ..                                                                                     64-Bit Process Address Space
665     --------------------------------- --------------- ----------- ---------------- ------------------------------------
666     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
667                                       Space Number    Name        Name             Size
668     ================================= =============== =========== ================ ======= ============================
669     Generic                           0               flat        flat             64      0x0000000000000000
670     Global                            1               global      global           64      0x0000000000000000
671     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
672     Local                             3               group       LDS              32      0xFFFFFFFF
673     Constant                          4               constant    *same as global* 64      0x0000000000000000
674     Private                           5               private     scratch          32      0xFFFFFFFF
675     Constant 32-bit                   6               *TODO*                               0x00000000
676     Buffer Fat Pointer (experimental) 7               *TODO*
677     ================================= =============== =========== ================ ======= ============================
678
679**Generic**
680  The generic address space is supported unless the *Target Properties* column
681  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
682  space*.
683
684  The generic address space uses the hardware flat address support for two fixed
685  ranges of virtual addresses (the private and local apertures), that are
686  outside the range of addressable global memory, to map from a flat address to
687  a private or local address. This uses FLAT instructions that can take a flat
688  address and access global, private (scratch), and group (LDS) memory depending
689  on if the address is within one of the aperture ranges.
690
691  Flat access to scratch requires hardware aperture setup and setup in the
692  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
693  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
694  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
695
696  To convert between a private or group address space address (termed a segment
697  address) and a flat address the base address of the corresponding aperture
698  can be used. For GFX7-GFX8 these are available in the
699  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
700  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
701  GFX9-GFX11 the aperture base addresses are directly available as inline
702  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
703  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
704  aligned to 2^32 which makes it easier to convert from flat to segment or
705  segment to flat.
706
707  A global address space address has the same value when used as a flat address
708  so no conversion is needed.
709
710**Global and Constant**
711  The global and constant address spaces both use global virtual addresses,
712  which are the same virtual address space used by the CPU. However, some
713  virtual addresses may only be accessible to the CPU, some only accessible
714  by the GPU, and some by both.
715
716  Using the constant address space indicates that the data will not change
717  during the execution of the kernel. This allows scalar read instructions to
718  be used. As the constant address space could only be modified on the host
719  side, a generic pointer loaded from the constant address space is safe to be
720  assumed as a global pointer since only the device global memory is visible
721  and managed on the host side. The vector and scalar L1 caches are invalidated
722  of volatile data before each kernel dispatch execution to allow constant
723  memory to change values between kernel dispatches.
724
725**Region**
726  The region address space uses the hardware Global Data Store (GDS). All
727  wavefronts executing on the same device will access the same memory for any
728  given region address. However, the same region address accessed by wavefronts
729  executing on different devices will access different memory. It is higher
730  performance than global memory. It is allocated by the runtime. The data
731  store (DS) instructions can be used to access it.
732
733**Local**
734  The local address space uses the hardware Local Data Store (LDS) which is
735  automatically allocated when the hardware creates the wavefronts of a
736  work-group, and freed when all the wavefronts of a work-group have
737  terminated. All wavefronts belonging to the same work-group will access the
738  same memory for any given local address. However, the same local address
739  accessed by wavefronts belonging to different work-groups will access
740  different memory. It is higher performance than global memory. The data store
741  (DS) instructions can be used to access it.
742
743**Private**
744  The private address space uses the hardware scratch memory support which
745  automatically allocates memory when it creates a wavefront and frees it when
746  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
747  given private address will be different to the memory accessed by another lane
748  of the same or different wavefront for the same private address.
749
750  If a kernel dispatch uses scratch, then the hardware allocates memory from a
751  pool of backing memory allocated by the runtime for each wavefront. The lanes
752  of the wavefront access this using dword (4 byte) interleaving. The mapping
753  used from private address to backing memory address is:
754
755    ``wavefront-scratch-base +
756    ((private-address / 4) * wavefront-size * 4) +
757    (wavefront-lane-id * 4) + (private-address % 4)``
758
759  If each lane of a wavefront accesses the same private address, the
760  interleaving results in adjacent dwords being accessed and hence requires
761  fewer cache lines to be fetched.
762
763  There are different ways that the wavefront scratch base address is
764  determined by a wavefront (see
765  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
766
767  Scratch memory can be accessed in an interleaved manner using buffer
768  instructions with the scratch buffer descriptor and per wavefront scratch
769  offset, by the scratch instructions, or by flat instructions. Multi-dword
770  access is not supported except by flat and scratch instructions in
771  GFX9-GFX11.
772
773**Constant 32-bit**
774  *TODO*
775
776**Buffer Fat Pointer**
777  The buffer fat pointer is an experimental address space that is currently
778  unsupported in the backend. It exposes a non-integral pointer that is in
779  the future intended to support the modelling of 128-bit buffer descriptors
780  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
781  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
782  model the buffer descriptors used heavily in graphics workloads targeting
783  the backend.
784
785.. _amdgpu-memory-scopes:
786
787Memory Scopes
788-------------
789
790This section provides LLVM memory synchronization scopes supported by the AMDGPU
791backend memory model when the target triple OS is ``amdhsa`` (see
792:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
793
794The memory model supported is based on the HSA memory model [HSA]_ which is
795based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
796relation is transitive over the synchronizes-with relation independent of scope
797and synchronizes-with allows the memory scope instances to be inclusive (see
798table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
799
800This is different to the OpenCL [OpenCL]_ memory model which does not have scope
801inclusion and requires the memory scopes to exactly match. However, this
802is conservatively correct for OpenCL.
803
804  .. table:: AMDHSA LLVM Sync Scopes
805     :name: amdgpu-amdhsa-llvm-sync-scopes-table
806
807     ======================= ===================================================
808     LLVM Sync Scope         Description
809     ======================= ===================================================
810     *none*                  The default: ``system``.
811
812                             Synchronizes with, and participates in modification
813                             and seq_cst total orderings with, other operations
814                             (except image operations) for all address spaces
815                             (except private, or generic that accesses private)
816                             provided the other operation's sync scope is:
817
818                             - ``system``.
819                             - ``agent`` and executed by a thread on the same
820                               agent.
821                             - ``workgroup`` and executed by a thread in the
822                               same work-group.
823                             - ``wavefront`` and executed by a thread in the
824                               same wavefront.
825
826     ``agent``               Synchronizes with, and participates in modification
827                             and seq_cst total orderings with, other operations
828                             (except image operations) for all address spaces
829                             (except private, or generic that accesses private)
830                             provided the other operation's sync scope is:
831
832                             - ``system`` or ``agent`` and executed by a thread
833                               on the same agent.
834                             - ``workgroup`` and executed by a thread in the
835                               same work-group.
836                             - ``wavefront`` and executed by a thread in the
837                               same wavefront.
838
839     ``workgroup``           Synchronizes with, and participates in modification
840                             and seq_cst total orderings with, other operations
841                             (except image operations) for all address spaces
842                             (except private, or generic that accesses private)
843                             provided the other operation's sync scope is:
844
845                             - ``system``, ``agent`` or ``workgroup`` and
846                               executed by a thread in the same work-group.
847                             - ``wavefront`` and executed by a thread in the
848                               same wavefront.
849
850     ``wavefront``           Synchronizes with, and participates in modification
851                             and seq_cst total orderings with, other operations
852                             (except image operations) for all address spaces
853                             (except private, or generic that accesses private)
854                             provided the other operation's sync scope is:
855
856                             - ``system``, ``agent``, ``workgroup`` or
857                               ``wavefront`` and executed by a thread in the
858                               same wavefront.
859
860     ``singlethread``        Only synchronizes with and participates in
861                             modification and seq_cst total orderings with,
862                             other operations (except image operations) running
863                             in the same thread for all address spaces (for
864                             example, in signal handlers).
865
866     ``one-as``              Same as ``system`` but only synchronizes with other
867                             operations within the same address space.
868
869     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
870                             operations within the same address space.
871
872     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
873                             other operations within the same address space.
874
875     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
876                             other operations within the same address space.
877
878     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
879                             other operations within the same address space.
880     ======================= ===================================================
881
882LLVM IR Intrinsics
883------------------
884
885The AMDGPU backend implements the following LLVM IR intrinsics.
886
887*This section is WIP.*
888
889.. TODO::
890
891   List AMDGPU intrinsics.
892
893LLVM IR Attributes
894------------------
895
896The AMDGPU backend supports the following LLVM IR attributes.
897
898  .. table:: AMDGPU LLVM IR Attributes
899     :name: amdgpu-llvm-ir-attributes-table
900
901     ======================================= ==========================================================
902     LLVM Attribute                          Description
903     ======================================= ==========================================================
904     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
905                                             will be specified when the kernel is dispatched. Generated
906                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
907                                             The implied default value is 1,1024.
908
909     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
910                                             argument block size for the implicit arguments. This
911                                             varies by OS and language (for OpenCL see
912                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
913     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
914                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
915     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
916                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
917     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
918                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
919                                             CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
920                                             and the backend may not be able to satisfy the request. If
921                                             the specified range is incompatible with the function's
922                                             "amdgpu-flat-work-group-size" value, the implied occupancy
923                                             bounds by the workgroup size takes precedence.
924
925     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
926                                             mode register to be set on entry. Overrides the default for
927                                             the calling convention.
928     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
929                                             the mode register to be set on entry. Overrides the default
930                                             for the calling convention.
931
932     "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
933                                             llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
934                                             attribute, or reached through a call site marked with this attribute,
935                                             the value returned by the intrinsic is undefined. The backend can
936                                             generally infer this during code generation, so typically there is no
937                                             benefit to frontends marking functions with this.
938
939     "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
940                                             llvm.amdgcn.workitem.id.y intrinsic.
941
942     "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
943                                             llvm.amdgcn.workitem.id.z intrinsic.
944
945     "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
946                                             llvm.amdgcn.workgroup.id.x intrinsic.
947
948     "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
949                                             llvm.amdgcn.workgroup.id.y intrinsic.
950
951     "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
952                                             llvm.amdgcn.workgroup.id.z intrinsic.
953
954     "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
955                                             llvm.amdgcn.dispatch.ptr intrinsic.
956
957     "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
958                                             llvm.amdgcn.implicitarg.ptr intrinsic.
959
960     "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
961                                             llvm.amdgcn.dispatch.id intrinsic.
962
963     "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
964                                             llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
965                                             attributes, the queue pointer may be required in situations where the
966                                             intrinsic call does not directly appear in the program. Some subtargets
967                                             require the queue pointer for to handle some addrspacecasts, as well
968                                             as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
969                                             llvm.debug intrinsics.
970
971     "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
972                                             kernel argument that holds the pointer to the hostcall buffer. If this
973                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
974
975     "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
976                                             kernel argument that holds the pointer to an initialized memory buffer
977                                             that conforms to the requirements of the malloc/free device library V1
978                                             version implementation. If this attribute is absent, then the
979                                             amdgpu-no-implicitarg-ptr is also removed.
980
981     "amdgpu-no-multigrid-sync-arg"          Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
982                                             kernel argument that holds the multigrid synchronization pointer. If this
983                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
984     ======================================= ==========================================================
985
986.. _amdgpu-elf-code-object:
987
988ELF Code Object
989===============
990
991The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
992can be linked by ``lld`` to produce a standard ELF shared code object which can
993be loaded and executed on an AMDGPU target.
994
995.. _amdgpu-elf-header:
996
997Header
998------
999
1000The AMDGPU backend uses the following ELF header:
1001
1002  .. table:: AMDGPU ELF Header
1003     :name: amdgpu-elf-header-table
1004
1005     ========================== ===============================
1006     Field                      Value
1007     ========================== ===============================
1008     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
1009     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
1010     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
1011                                - ``ELFOSABI_AMDGPU_HSA``
1012                                - ``ELFOSABI_AMDGPU_PAL``
1013                                - ``ELFOSABI_AMDGPU_MESA3D``
1014     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1015                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
1016                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
1017                                - ``ELFABIVERSION_AMDGPU_HSA_V5``
1018                                - ``ELFABIVERSION_AMDGPU_PAL``
1019                                - ``ELFABIVERSION_AMDGPU_MESA3D``
1020     ``e_type``                 - ``ET_REL``
1021                                - ``ET_DYN``
1022     ``e_machine``              ``EM_AMDGPU``
1023     ``e_entry``                0
1024     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1025                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
1026                                and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1027     ========================== ===============================
1028
1029..
1030
1031  .. table:: AMDGPU ELF Header Enumeration Values
1032     :name: amdgpu-elf-header-enumeration-values-table
1033
1034     =============================== =====
1035     Name                            Value
1036     =============================== =====
1037     ``EM_AMDGPU``                   224
1038     ``ELFOSABI_NONE``               0
1039     ``ELFOSABI_AMDGPU_HSA``         64
1040     ``ELFOSABI_AMDGPU_PAL``         65
1041     ``ELFOSABI_AMDGPU_MESA3D``      66
1042     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1043     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1044     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1045     ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1046     ``ELFABIVERSION_AMDGPU_PAL``    0
1047     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1048     =============================== =====
1049
1050``e_ident[EI_CLASS]``
1051  The ELF class is:
1052
1053  * ``ELFCLASS32`` for ``r600`` architecture.
1054
1055  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1056    process address space applications.
1057
1058``e_ident[EI_DATA]``
1059  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1060
1061``e_ident[EI_OSABI]``
1062  One of the following AMDGPU target architecture specific OS ABIs
1063  (see :ref:`amdgpu-os`):
1064
1065  * ``ELFOSABI_NONE`` for *unknown* OS.
1066
1067  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1068
1069  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1070
1071  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1072
1073``e_ident[EI_ABIVERSION]``
1074  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1075  object conforms:
1076
1077  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1078    runtime ABI for code object V2. Specify using the Clang option
1079    ``-mcode-object-version=2``.
1080
1081  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1082    runtime ABI for code object V3. Specify using the Clang option
1083    ``-mcode-object-version=3``.
1084
1085  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1086    runtime ABI for code object V4. Specify using the Clang option
1087    ``-mcode-object-version=4``. This is the default code object
1088    version if not specified.
1089
1090  * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1091    runtime ABI for code object V5. Specify using the Clang option
1092    ``-mcode-object-version=5``.
1093
1094  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1095    runtime ABI.
1096
1097  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1098    3D runtime ABI.
1099
1100``e_type``
1101  Can be one of the following values:
1102
1103
1104  ``ET_REL``
1105    The type produced by the AMDGPU backend compiler as it is relocatable code
1106    object.
1107
1108  ``ET_DYN``
1109    The type produced by the linker as it is a shared code object.
1110
1111  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1112
1113``e_machine``
1114  The value ``EM_AMDGPU`` is used for the machine for all processors supported
1115  by the ``r600`` and ``amdgcn`` architectures (see
1116  :ref:`amdgpu-processor-table`). The specific processor is specified in the
1117  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1118  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1119  ``e_flags`` for code object V3 and above (see
1120  :ref:`amdgpu-elf-header-e_flags-table-v3` and
1121  :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1122
1123``e_entry``
1124  The entry point is 0 as the entry points for individual kernels must be
1125  selected in order to invoke them through AQL packets.
1126
1127``e_flags``
1128  The AMDGPU backend uses the following ELF header flags:
1129
1130  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1131     :name: amdgpu-elf-header-e_flags-v2-table
1132
1133     ===================================== ===== =============================
1134     Name                                  Value Description
1135     ===================================== ===== =============================
1136     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1137                                                 target feature is
1138                                                 enabled for all code
1139                                                 contained in the code object.
1140                                                 If the processor
1141                                                 does not support the
1142                                                 ``xnack`` target
1143                                                 feature then must
1144                                                 be 0.
1145                                                 See
1146                                                 :ref:`amdgpu-target-features`.
1147     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1148                                                 handler is enabled for all
1149                                                 code contained in the code
1150                                                 object. If the processor
1151                                                 does not support a trap
1152                                                 handler then must be 0.
1153                                                 See
1154                                                 :ref:`amdgpu-target-features`.
1155     ===================================== ===== =============================
1156
1157  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1158     :name: amdgpu-elf-header-e_flags-table-v3
1159
1160     ================================= ===== =============================
1161     Name                              Value Description
1162     ================================= ===== =============================
1163     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1164                                             mask for
1165                                             ``EF_AMDGPU_MACH_xxx`` values
1166                                             defined in
1167                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
1168     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1169                                             target feature is
1170                                             enabled for all code
1171                                             contained in the code object.
1172                                             If the processor
1173                                             does not support the
1174                                             ``xnack`` target
1175                                             feature then must
1176                                             be 0.
1177                                             See
1178                                             :ref:`amdgpu-target-features`.
1179     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1180                                             target feature is
1181                                             enabled for all code
1182                                             contained in the code object.
1183                                             If the processor
1184                                             does not support the
1185                                             ``sramecc`` target
1186                                             feature then must
1187                                             be 0.
1188                                             See
1189                                             :ref:`amdgpu-target-features`.
1190     ================================= ===== =============================
1191
1192  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1193     :name: amdgpu-elf-header-e_flags-table-v4-onwards
1194
1195     ============================================ ===== ===================================
1196     Name                                         Value      Description
1197     ============================================ ===== ===================================
1198     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1199                                                        mask for
1200                                                        ``EF_AMDGPU_MACH_xxx`` values
1201                                                        defined in
1202                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
1203     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1204                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1205                                                        values.
1206     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1207     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1208     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1209     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1210     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1211                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1212                                                        values.
1213     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1214     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1215     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1216     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1217     ============================================ ===== ===================================
1218
1219  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1220     :name: amdgpu-ef-amdgpu-mach-table
1221
1222     ==================================== ========== =============================
1223     Name                                 Value      Description (see
1224                                                     :ref:`amdgpu-processor-table`)
1225     ==================================== ========== =============================
1226     ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1227     ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1228     ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1229     ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1230     ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1231     ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1232     ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1233     ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1234     ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1235     ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1236     ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1237     ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1238     ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1239     ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1240     ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1241     ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1242     ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1243     *reserved*                           0x011 -    Reserved for ``r600``
1244                                          0x01f      architecture processors.
1245     ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1246     ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1247     ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1248     ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1249     ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1250     ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1251     ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1252     *reserved*                           0x027      Reserved.
1253     ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1254     ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1255     ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1256     ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1257     ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1258     ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1259     ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1260     ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1261     ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1262     ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1263     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1264     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1265     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1266     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1267     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1268     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1269     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1270     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1271     ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1272     ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1273     ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1274     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1275     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1276     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1277     ``EF_AMDGPU_MACH_AMDGCN_GFX940``     0x040      ``gfx940``
1278     ``EF_AMDGPU_MACH_AMDGCN_GFX1100``    0x041      ``gfx1100``
1279     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1280     *reserved*                           0x043      Reserved.
1281     ``EF_AMDGPU_MACH_AMDGCN_GFX1103``    0x044      ``gfx1103``
1282     ``EF_AMDGPU_MACH_AMDGCN_GFX1036``    0x045      ``gfx1036``
1283     ``EF_AMDGPU_MACH_AMDGCN_GFX1101``    0x046      ``gfx1101``
1284     ``EF_AMDGPU_MACH_AMDGCN_GFX1102``    0x047      ``gfx1102``
1285     ==================================== ========== =============================
1286
1287Sections
1288--------
1289
1290An AMDGPU target ELF code object has the standard ELF sections which include:
1291
1292  .. table:: AMDGPU ELF Sections
1293     :name: amdgpu-elf-sections-table
1294
1295     ================== ================ =================================
1296     Name               Type             Attributes
1297     ================== ================ =================================
1298     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1299     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1300     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1301     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1302     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1303     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1304     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1305     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1306     ``.note``          ``SHT_NOTE``     *none*
1307     ``.rela``\ *name*  ``SHT_RELA``     *none*
1308     ``.rela.dyn``      ``SHT_RELA``     *none*
1309     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1310     ``.shstrtab``      ``SHT_STRTAB``   *none*
1311     ``.strtab``        ``SHT_STRTAB``   *none*
1312     ``.symtab``        ``SHT_SYMTAB``   *none*
1313     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1314     ================== ================ =================================
1315
1316These sections have their standard meanings (see [ELF]_) and are only generated
1317if needed.
1318
1319``.debug``\ *\**
1320  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1321  information on the DWARF produced by the AMDGPU backend.
1322
1323``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1324  The standard sections used by a dynamic loader.
1325
1326``.note``
1327  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1328  backend.
1329
1330``.rela``\ *name*, ``.rela.dyn``
1331  For relocatable code objects, *name* is the name of the section that the
1332  relocation records apply. For example, ``.rela.text`` is the section name for
1333  relocation records associated with the ``.text`` section.
1334
1335  For linked shared code objects, ``.rela.dyn`` contains all the relocation
1336  records from each of the relocatable code object's ``.rela``\ *name* sections.
1337
1338  See :ref:`amdgpu-relocation-records` for the relocation records supported by
1339  the AMDGPU backend.
1340
1341``.text``
1342  The executable machine code for the kernels and functions they call. Generated
1343  as position independent code. See :ref:`amdgpu-code-conventions` for
1344  information on conventions used in the isa generation.
1345
1346.. _amdgpu-note-records:
1347
1348Note Records
1349------------
1350
1351The AMDGPU backend code object contains ELF note records in the ``.note``
1352section. The set of generated notes and their semantics depend on the code
1353object version; see :ref:`amdgpu-note-records-v2` and
1354:ref:`amdgpu-note-records-v3-onwards`.
1355
1356As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1357must be generated after the ``name`` field to ensure the ``desc`` field is 4
1358byte aligned. In addition, minimal zero-byte padding must be generated to
1359ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1360field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1361alignment.
1362
1363.. _amdgpu-note-records-v2:
1364
1365Code Object V2 Note Records
1366~~~~~~~~~~~~~~~~~~~~~~~~~~~
1367
1368.. warning::
1369  Code object V2 is not the default code object version emitted by
1370  this version of LLVM.
1371
1372The AMDGPU backend code object uses the following ELF note record in the
1373``.note`` section when compiling for code object V2.
1374
1375The note record vendor field is "AMD".
1376
1377Additional note records may be present, but any which are not documented here
1378are deprecated and should not be used.
1379
1380  .. table:: AMDGPU Code Object V2 ELF Note Records
1381     :name: amdgpu-elf-note-records-v2-table
1382
1383     ===== ===================================== ======================================
1384     Name  Type                                  Description
1385     ===== ===================================== ======================================
1386     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1387     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1388                                                 Finalizer and not the LLVM compiler.
1389     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1390     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1391                                                 YAML [YAML]_ textual format.
1392     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1393     ===== ===================================== ======================================
1394
1395..
1396
1397  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1398     :name: amdgpu-elf-note-record-enumeration-values-v2-table
1399
1400     ===================================== =====
1401     Name                                  Value
1402     ===================================== =====
1403     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1404     ``NT_AMD_HSA_HSAIL``                  2
1405     ``NT_AMD_HSA_ISA_VERSION``            3
1406     *reserved*                            4-9
1407     ``NT_AMD_HSA_METADATA``               10
1408     ``NT_AMD_HSA_ISA_NAME``               11
1409     ===================================== =====
1410
1411``NT_AMD_HSA_CODE_OBJECT_VERSION``
1412  Specifies the code object version number. The description field has the
1413  following layout:
1414
1415  .. code:: c
1416
1417    struct amdgpu_hsa_note_code_object_version_s {
1418      uint32_t major_version;
1419      uint32_t minor_version;
1420    };
1421
1422  The ``major_version`` has a value less than or equal to 2.
1423
1424``NT_AMD_HSA_HSAIL``
1425  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1426  field has the following layout:
1427
1428  .. code:: c
1429
1430    struct amdgpu_hsa_note_hsail_s {
1431      uint32_t hsail_major_version;
1432      uint32_t hsail_minor_version;
1433      uint8_t profile;
1434      uint8_t machine_model;
1435      uint8_t default_float_round;
1436    };
1437
1438``NT_AMD_HSA_ISA_VERSION``
1439  Specifies the target ISA version. The description field has the following layout:
1440
1441  .. code:: c
1442
1443    struct amdgpu_hsa_note_isa_s {
1444      uint16_t vendor_name_size;
1445      uint16_t architecture_name_size;
1446      uint32_t major;
1447      uint32_t minor;
1448      uint32_t stepping;
1449      char vendor_and_architecture_name[1];
1450    };
1451
1452  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1453  vendor and architecture names respectively, including the NUL character.
1454
1455  ``vendor_and_architecture_name`` contains the NUL terminates string for the
1456  vendor, immediately followed by the NUL terminated string for the
1457  architecture.
1458
1459  This note record is used by the HSA runtime loader.
1460
1461  Code object V2 only supports a limited number of processors and has fixed
1462  settings for target features. See
1463  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1464  processors and the corresponding target ID. In the table the note record ISA
1465  name is a concatenation of the vendor name, architecture name, major, minor,
1466  and stepping separated by a ":".
1467
1468  The target ID column shows the processor name and fixed target features used
1469  by the LLVM compiler. The LLVM compiler does not generate a
1470  ``NT_AMD_HSA_HSAIL`` note record.
1471
1472  A code object generated by the Finalizer also uses code object V2 and always
1473  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1474  ``sramecc`` target feature is as shown in
1475  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1476  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1477  bit.
1478
1479``NT_AMD_HSA_ISA_NAME``
1480  Specifies the target ISA name as a non-NUL terminated string.
1481
1482  This note record is not used by the HSA runtime loader.
1483
1484  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1485  V2's limited support of processors and fixed settings for target features.
1486
1487  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1488  from the string to the corresponding target ID. If the ``xnack`` target
1489  feature is supported and enabled, the string produced by the LLVM compiler
1490  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1491  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1492
1493``NT_AMD_HSA_METADATA``
1494  Specifies extensible metadata associated with the code objects executed on HSA
1495  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1496  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1497  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1498  metadata string.
1499
1500  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1501     :name: amdgpu-elf-note-record-supported_processors-v2-table
1502
1503     ===================== ==========================
1504     Note Record ISA Name  Target ID
1505     ===================== ==========================
1506     ``AMD:AMDGPU:6:0:0``  ``gfx600``
1507     ``AMD:AMDGPU:6:0:1``  ``gfx601``
1508     ``AMD:AMDGPU:6:0:2``  ``gfx602``
1509     ``AMD:AMDGPU:7:0:0``  ``gfx700``
1510     ``AMD:AMDGPU:7:0:1``  ``gfx701``
1511     ``AMD:AMDGPU:7:0:2``  ``gfx702``
1512     ``AMD:AMDGPU:7:0:3``  ``gfx703``
1513     ``AMD:AMDGPU:7:0:4``  ``gfx704``
1514     ``AMD:AMDGPU:7:0:5``  ``gfx705``
1515     ``AMD:AMDGPU:8:0:0``  ``gfx802``
1516     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1517     ``AMD:AMDGPU:8:0:2``  ``gfx802``
1518     ``AMD:AMDGPU:8:0:3``  ``gfx803``
1519     ``AMD:AMDGPU:8:0:4``  ``gfx803``
1520     ``AMD:AMDGPU:8:0:5``  ``gfx805``
1521     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1522     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1523     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1524     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1525     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1526     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1527     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1528     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1529     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1530     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1531     ===================== ==========================
1532
1533.. _amdgpu-note-records-v3-onwards:
1534
1535Code Object V3 and Above Note Records
1536~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1537
1538The AMDGPU backend code object uses the following ELF note record in the
1539``.note`` section when compiling for code object V3 and above.
1540
1541The note record vendor field is "AMDGPU".
1542
1543Additional note records may be present, but any which are not documented here
1544are deprecated and should not be used.
1545
1546  .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1547     :name: amdgpu-elf-note-records-table-v3-onwards
1548
1549     ======== ============================== ======================================
1550     Name     Type                           Description
1551     ======== ============================== ======================================
1552     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1553                                             binary format.
1554     ======== ============================== ======================================
1555
1556..
1557
1558  .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1559     :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1560
1561     ============================== =====
1562     Name                           Value
1563     ============================== =====
1564     *reserved*                     0-31
1565     ``NT_AMDGPU_METADATA``         32
1566     ============================== =====
1567
1568``NT_AMDGPU_METADATA``
1569  Specifies extensible metadata associated with an AMDGPU code object. It is
1570  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1571  :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1572  :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1573  :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1574  ``amdhsa`` OS.
1575
1576.. _amdgpu-symbols:
1577
1578Symbols
1579-------
1580
1581Symbols include the following:
1582
1583  .. table:: AMDGPU ELF Symbols
1584     :name: amdgpu-elf-symbols-table
1585
1586     ===================== ================== ================ ==================
1587     Name                  Type               Section          Description
1588     ===================== ================== ================ ==================
1589     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1590                                              - ``.rodata``
1591                                              - ``.bss``
1592     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1593     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1594     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1595     ===================== ================== ================ ==================
1596
1597Global variable
1598  Global variables both used and defined by the compilation unit.
1599
1600  If the symbol is defined in the compilation unit then it is allocated in the
1601  appropriate section according to if it has initialized data or is readonly.
1602
1603  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1604  will resolve relocations using the definition provided by another code object
1605  or explicitly defined by the runtime.
1606
1607  If the symbol resides in local/group memory (LDS) then its section is the
1608  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1609  ``st_value`` field describes alignment requirements as it does for common
1610  symbols.
1611
1612  .. TODO::
1613
1614     Add description of linked shared object symbols. Seems undefined symbols
1615     are marked as STT_NOTYPE.
1616
1617Kernel descriptor
1618  Every HSA kernel has an associated kernel descriptor. It is the address of the
1619  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1620  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1621  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1622
1623Kernel entry point
1624  Every HSA kernel also has a symbol for its machine code entry point.
1625
1626.. _amdgpu-relocation-records:
1627
1628Relocation Records
1629------------------
1630
1631AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1632relocatable fields are:
1633
1634``word32``
1635  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1636  alignment. These values use the same byte order as other word values in the
1637  AMDGPU architecture.
1638
1639``word64``
1640  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1641  alignment. These values use the same byte order as other word values in the
1642  AMDGPU architecture.
1643
1644Following notations are used for specifying relocation calculations:
1645
1646**A**
1647  Represents the addend used to compute the value of the relocatable field.
1648
1649**G**
1650  Represents the offset into the global offset table at which the relocation
1651  entry's symbol will reside during execution.
1652
1653**GOT**
1654  Represents the address of the global offset table.
1655
1656**P**
1657  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1658  of the storage unit being relocated (computed using ``r_offset``).
1659
1660**S**
1661  Represents the value of the symbol whose index resides in the relocation
1662  entry. Relocations not using this must specify a symbol index of
1663  ``STN_UNDEF``.
1664
1665**B**
1666  Represents the base address of a loaded executable or shared object which is
1667  the difference between the ELF address and the actual load address.
1668  Relocations using this are only valid in executable or shared objects.
1669
1670The following relocation types are supported:
1671
1672  .. table:: AMDGPU ELF Relocation Records
1673     :name: amdgpu-elf-relocation-records-table
1674
1675     ========================== ======= =====  ==========  ==============================
1676     Relocation Type            Kind    Value  Field       Calculation
1677     ========================== ======= =====  ==========  ==============================
1678     ``R_AMDGPU_NONE``                  0      *none*      *none*
1679     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1680                                Dynamic
1681     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1682                                Dynamic
1683     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1684                                Dynamic
1685     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1686     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1687     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1688                                Dynamic
1689     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1690     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1691     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1692     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1693     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1694     *reserved*                         12
1695     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1696     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1697     ========================== ======= =====  ==========  ==============================
1698
1699``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1700the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1701
1702There is no current OS loader support for 32-bit programs and so
1703``R_AMDGPU_ABS32`` is not used.
1704
1705.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1706
1707Loaded Code Object Path Uniform Resource Identifier (URI)
1708---------------------------------------------------------
1709
1710The AMD GPU code object loader represents the path of the ELF shared object from
1711which the code object was loaded as a textual Uniform Resource Identifier (URI).
1712Note that the code object is the in memory loaded relocated form of the ELF
1713shared object.  Multiple code objects may be loaded at different memory
1714addresses in the same process from the same ELF shared object.
1715
1716The loaded code object path URI syntax is defined by the following BNF syntax:
1717
1718.. code::
1719
1720  code_object_uri ::== file_uri | memory_uri
1721  file_uri        ::== "file://" file_path [ range_specifier ]
1722  memory_uri      ::== "memory://" process_id range_specifier
1723  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1724  file_path       ::== URI_ENCODED_OS_FILE_PATH
1725  process_id      ::== DECIMAL_NUMBER
1726  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1727
1728**number**
1729  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1730  and octal values by "0".
1731
1732**file_path**
1733  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1734  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1735  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1736  the path are separated by "/".
1737
1738**offset**
1739  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1740  is from the start of the file specified by the ``file_path``, and if omitted
1741  defaults to 0. For a memory URI, it is the memory address and is required.
1742
1743**size**
1744  Is the number of bytes in the code object.  For a file URI, if omitted it
1745  defaults to the size of the file.  It is required for a memory URI.
1746
1747**process_id**
1748  Is the identity of the process owning the memory.  For Linux it is the C
1749  unsigned integral decimal literal for the process ID (PID).
1750
1751For example:
1752
1753.. code::
1754
1755  file:///dir1/dir2/file1
1756  file:///dir3/dir4/file2#offset=0x2000&size=3000
1757  memory://1234#offset=0x20000&size=3000
1758
1759.. _amdgpu-dwarf-debug-information:
1760
1761DWARF Debug Information
1762=======================
1763
1764.. warning::
1765
1766   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1767   is not currently fully implemented and is subject to change.
1768
1769AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1770:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1771object executable code and data to the source language constructs. It can be
1772used by tools such as debuggers and profilers. It uses features defined in
1773:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1774DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1775
1776This section defines the AMDGPU target architecture specific DWARF mappings.
1777
1778.. _amdgpu-dwarf-register-identifier:
1779
1780Register Identifier
1781-------------------
1782
1783This section defines the AMDGPU target architecture register numbers used in
1784DWARF operation expressions (see DWARF Version 5 section 2.5 and
1785:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1786instructions (see DWARF Version 5 section 6.4 and
1787:ref:`amdgpu-dwarf-call-frame-information`).
1788
1789A single code object can contain code for kernels that have different wavefront
1790sizes. The vector registers and some scalar registers are based on the wavefront
1791size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1792simplifies the consumer of the DWARF so that each register has a fixed size,
1793rather than being dynamic according to the wavefront size mode. Similarly,
1794distinct DWARF registers are defined for those registers that vary in size
1795according to the process address size. This allows a consumer to treat a
1796specific AMDGPU processor as a single architecture regardless of how it is
1797configured at run time. The compiler explicitly specifies the DWARF registers
1798that match the mode in which the code it is generating will be executed.
1799
1800DWARF registers are encoded as numbers, which are mapped to architecture
1801registers. The mapping for AMDGPU is defined in
1802:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1803mapping.
1804
1805.. table:: AMDGPU DWARF Register Mapping
1806   :name: amdgpu-dwarf-register-mapping-table
1807
1808   ============== ================= ======== ==================================
1809   DWARF Register AMDGPU Register   Bit Size Description
1810   ============== ================= ======== ==================================
1811   0              PC_32             32       Program Counter (PC) when
1812                                             executing in a 32-bit process
1813                                             address space. Used in the CFI to
1814                                             describe the PC of the calling
1815                                             frame.
1816   1              EXEC_MASK_32      32       Execution Mask Register when
1817                                             executing in wavefront 32 mode.
1818   2-15           *Reserved*                 *Reserved for highly accessed
1819                                             registers using DWARF shortcut.*
1820   16             PC_64             64       Program Counter (PC) when
1821                                             executing in a 64-bit process
1822                                             address space. Used in the CFI to
1823                                             describe the PC of the calling
1824                                             frame.
1825   17             EXEC_MASK_64      64       Execution Mask Register when
1826                                             executing in wavefront 64 mode.
1827   18-31          *Reserved*                 *Reserved for highly accessed
1828                                             registers using DWARF shortcut.*
1829   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1830                                             Registers.
1831   96-127         *Reserved*                 *Reserved for frequently accessed
1832                                             registers using DWARF 1-byte ULEB.*
1833   128            STATUS            32       Status Register.
1834   129-511        *Reserved*                 *Reserved for future Scalar
1835                                             Architectural Registers.*
1836   512            VCC_32            32       Vector Condition Code Register
1837                                             when executing in wavefront 32
1838                                             mode.
1839   513-767        *Reserved*                 *Reserved for future Vector
1840                                             Architectural Registers when
1841                                             executing in wavefront 32 mode.*
1842   768            VCC_64            64       Vector Condition Code Register
1843                                             when executing in wavefront 64
1844                                             mode.
1845   769-1023       *Reserved*                 *Reserved for future Vector
1846                                             Architectural Registers when
1847                                             executing in wavefront 64 mode.*
1848   1024-1087      *Reserved*                 *Reserved for padding.*
1849   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1850   1130-1535      *Reserved*                 *Reserved for future Scalar
1851                                             General Purpose Registers.*
1852   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1853                                             when executing in wavefront 32
1854                                             mode.
1855   1792-2047      *Reserved*                 *Reserved for future Vector
1856                                             General Purpose Registers when
1857                                             executing in wavefront 32 mode.*
1858   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1859                                             when executing in wavefront 32
1860                                             mode.
1861   2304-2559      *Reserved*                 *Reserved for future Vector
1862                                             Accumulation Registers when
1863                                             executing in wavefront 32 mode.*
1864   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1865                                             when executing in wavefront 64
1866                                             mode.
1867   2816-3071      *Reserved*                 *Reserved for future Vector
1868                                             General Purpose Registers when
1869                                             executing in wavefront 64 mode.*
1870   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1871                                             when executing in wavefront 64
1872                                             mode.
1873   3328-3583      *Reserved*                 *Reserved for future Vector
1874                                             Accumulation Registers when
1875                                             executing in wavefront 64 mode.*
1876   ============== ================= ======== ==================================
1877
1878The vector registers are represented as the full size for the wavefront. They
1879are organized as consecutive dwords (32-bits), one per lane, with the dword at
1880the least significant bit position corresponding to lane 0 and so forth. DWARF
1881location expressions involving the ``DW_OP_LLVM_offset`` and
1882``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1883register corresponding to the lane that is executing the current thread of
1884execution in languages that are implemented using a SIMD or SIMT execution
1885model.
1886
1887If the wavefront size is 32 lanes then the wavefront 32 mode register
1888definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1889mode register definitions are used. Some AMDGPU targets support executing in
1890both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1891to the wavefront mode of the generated code will be used.
1892
1893If code is generated to execute in a 32-bit process address space, then the
189432-bit process address space register definitions are used. If code is generated
1895to execute in a 64-bit process address space, then the 64-bit process address
1896space register definitions are used. The ``amdgcn`` target only supports the
189764-bit process address space.
1898
1899.. _amdgpu-dwarf-address-class-identifier:
1900
1901Address Class Identifier
1902------------------------
1903
1904The DWARF address class represents the source language memory space. See DWARF
1905Version 5 section 2.12 which is updated by the *DWARF Extensions For
1906Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1907
1908The DWARF address class mapping used for AMDGPU is defined in
1909:ref:`amdgpu-dwarf-address-class-mapping-table`.
1910
1911.. table:: AMDGPU DWARF Address Class Mapping
1912   :name: amdgpu-dwarf-address-class-mapping-table
1913
1914   ========================= ====== =================
1915   DWARF                            AMDGPU
1916   -------------------------------- -----------------
1917   Address Class Name        Value  Address Space
1918   ========================= ====== =================
1919   ``DW_ADDR_none``          0x0000 Generic (Flat)
1920   ``DW_ADDR_LLVM_global``   0x0001 Global
1921   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1922   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1923   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1924   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1925   ========================= ====== =================
1926
1927The DWARF address class values defined in the *DWARF Extensions For
1928Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1929
1930In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1931available for use for the AMD extension for access to the hardware GDS memory
1932which is scratchpad memory allocated per device.
1933
1934For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1935address class of ``DW_ADDR_none`` is used.
1936
1937See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1938mapping of DWARF address classes to DWARF address spaces, including address size
1939and NULL value.
1940
1941.. _amdgpu-dwarf-address-space-identifier:
1942
1943Address Space Identifier
1944------------------------
1945
1946DWARF address spaces correspond to target architecture specific linear
1947addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1948For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1949
1950The DWARF address space mapping used for AMDGPU is defined in
1951:ref:`amdgpu-dwarf-address-space-mapping-table`.
1952
1953.. table:: AMDGPU DWARF Address Space Mapping
1954   :name: amdgpu-dwarf-address-space-mapping-table
1955
1956   ======================================= ===== ======= ======== ================= =======================
1957   DWARF                                                          AMDGPU            Notes
1958   --------------------------------------- ----- ---------------- ----------------- -----------------------
1959   Address Space Name                      Value Address Bit Size Address Space
1960   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1961   ..                                            64-bit  32-bit
1962                                                 process process
1963                                                 address address
1964                                                 space   space
1965   ======================================= ===== ======= ======== ================= =======================
1966   ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1967   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1968   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1969   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1970   *Reserved*                              0x04
1971   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1972   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1973   ======================================= ===== ======= ======== ================= =======================
1974
1975See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1976including address size and NULL value.
1977
1978The ``DW_ASPACE_none`` address space is the default target architecture address
1979space used in DWARF operations that do not specify an address space. It
1980therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1981related operations can refer to addresses in the program code.
1982
1983The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1984specify the flat address space. If the address corresponds to an address in the
1985local address space, then it corresponds to the wavefront that is executing the
1986focused thread of execution. If the address corresponds to an address in the
1987private address space, then it corresponds to the lane that is executing the
1988focused thread of execution for languages that are implemented using a SIMD or
1989SIMT execution model.
1990
1991.. note::
1992
1993  CUDA-like languages such as HIP that do not have address spaces in the
1994  language type system, but do allow variables to be allocated in different
1995  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1996  address space in the DWARF expression operations as the default address space
1997  is the global address space.
1998
1999The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2000specify the local address space corresponding to the wavefront that is executing
2001the focused thread of execution.
2002
2003The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2004to specify the private address space corresponding to the lane that is executing
2005the focused thread of execution for languages that are implemented using a SIMD
2006or SIMT execution model.
2007
2008The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2009to specify the unswizzled private address space corresponding to the wavefront
2010that is executing the focused thread of execution. The wavefront view of private
2011memory is the per wavefront unswizzled backing memory layout defined in
2012:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2013location for the backing memory of the wavefront (namely the address is not
2014offset by ``wavefront-scratch-base``). The following formula can be used to
2015convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2016``DW_ASPACE_AMDGPU_private_wave`` address:
2017
2018::
2019
2020  private-address-wavefront =
2021    ((private-address-lane / 4) * wavefront-size * 4) +
2022    (wavefront-lane-id * 4) + (private-address-lane % 4)
2023
2024If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2025of the dwords for each lane starting with lane 0 is required, then this
2026simplifies to:
2027
2028::
2029
2030  private-address-wavefront =
2031    private-address-lane * wavefront-size
2032
2033A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2034complete spilled vector register back into a complete vector register in the
2035CFI. The frame pointer can be a private lane address which is dword aligned,
2036which can be shifted to multiply by the wavefront size, and then used to form a
2037private wavefront address that gives a location for a contiguous set of dwords,
2038one per lane, where the vector register dwords are spilled. The compiler knows
2039the wavefront size since it generates the code. Note that the type of the
2040address may have to be converted as the size of a
2041``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2042``DW_ASPACE_AMDGPU_private_wave`` address.
2043
2044.. _amdgpu-dwarf-lane-identifier:
2045
2046Lane identifier
2047---------------
2048
2049DWARF lane identifies specify a target architecture lane position for hardware
2050that executes in a SIMD or SIMT manner, and on which a source language maps its
2051threads of execution onto those lanes. The DWARF lane identifier is pushed by
2052the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2053section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2054section :ref:`amdgpu-dwarf-operation-expressions`.
2055
2056For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2057wavefront. It is numbered from 0 to the wavefront size minus 1.
2058
2059Operation Expressions
2060---------------------
2061
2062DWARF expressions are used to compute program values and the locations of
2063program objects. See DWARF Version 5 section 2.5 and
2064:ref:`amdgpu-dwarf-operation-expressions`.
2065
2066DWARF location descriptions describe how to access storage which includes memory
2067and registers. When accessing storage on AMDGPU, bytes are ordered with least
2068significant bytes first, and bits are ordered within bytes with least
2069significant bits first.
2070
2071For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2072unwinding vector registers that are spilled under the execution mask to memory:
2073the zero-single location description is the vector register, and the one-single
2074location description is the spilled memory location description. The
2075``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2076memory location description.
2077
2078In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2079``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2080controlled by the execution mask. An undefined location description together
2081with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2082to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2083
2084Debugger Information Entry Attributes
2085-------------------------------------
2086
2087This section describes how certain debugger information entry attributes are
2088used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2089which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2090:ref:`amdgpu-dwarf-low-level-information` and
2091:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2092
2093.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2094
2095``DW_AT_LLVM_lane_pc``
2096~~~~~~~~~~~~~~~~~~~~~~
2097
2098For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2099location of the separate lanes of a SIMT thread.
2100
2101If the lane is an active lane then this will be the same as the current program
2102location.
2103
2104If the lane is inactive, but was active on entry to the subprogram, then this is
2105the program location in the subprogram at which execution of the lane is
2106conceptual positioned.
2107
2108If the lane was not active on entry to the subprogram, then this will be the
2109undefined location. A client debugger can check if the lane is part of a valid
2110work-group by checking that the lane is in the range of the associated
2111work-group within the grid, accounting for partial work-groups. If it is not,
2112then the debugger can omit any information for the lane. Otherwise, the debugger
2113may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2114calling subprogram until it finds a non-undefined location. Conceptually the
2115lane only has the call frames that it has a non-undefined
2116``DW_AT_LLVM_lane_pc``.
2117
2118The following example illustrates how the AMDGPU backend can generate a DWARF
2119location list expression for the nested ``IF/THEN/ELSE`` structures of the
2120following subprogram pseudo code for a target with 64 lanes per wavefront.
2121
2122.. code::
2123  :number-lines:
2124
2125  SUBPROGRAM X
2126  BEGIN
2127    a;
2128    IF (c1) THEN
2129      b;
2130      IF (c2) THEN
2131        c;
2132      ELSE
2133        d;
2134      ENDIF
2135      e;
2136    ELSE
2137      f;
2138    ENDIF
2139    g;
2140  END
2141
2142The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2143execution mask (``EXEC``) to linearize the control flow. The condition is
2144evaluated to make a mask of the lanes for which the condition evaluates to true.
2145First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2146logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2147``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2148the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2149region the ``EXEC`` mask is restored to the value it had at the beginning of the
2150region. This is shown below. Other approaches are possible, but the basic
2151concept is the same.
2152
2153.. code::
2154  :number-lines:
2155
2156  $lex_start:
2157    a;
2158    %1 = EXEC
2159    %2 = c1
2160  $lex_1_start:
2161    EXEC = %1 & %2
2162  $if_1_then:
2163      b;
2164      %3 = EXEC
2165      %4 = c2
2166  $lex_1_1_start:
2167      EXEC = %3 & %4
2168  $lex_1_1_then:
2169        c;
2170      EXEC = ~EXEC & %3
2171  $lex_1_1_else:
2172        d;
2173      EXEC = %3
2174  $lex_1_1_end:
2175      e;
2176    EXEC = ~EXEC & %1
2177  $lex_1_else:
2178      f;
2179    EXEC = %1
2180  $lex_1_end:
2181    g;
2182  $lex_end:
2183
2184To create the DWARF location list expression that defines the location
2185description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2186pseudo instruction can be used to annotate the linearized control flow. This can
2187be done by defining an artificial variable for the lane PC. The DWARF location
2188list expression created for it is used as the value of the
2189``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2190
2191A DWARF procedure is defined for each well nested structured control flow region
2192which provides the conceptual lane program location for a lane if it is not
2193active (namely it is divergent). The DWARF operation expression for each region
2194conceptually inherits the value of the immediately enclosing region and modifies
2195it according to the semantics of the region.
2196
2197For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2198the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2199region the divergent program location is at the end of the ``IF/THEN/ELSE``
2200region since the ``THEN`` region has completed.
2201
2202The lane PC artificial variable is assigned at each region transition. It uses
2203the immediately enclosing region's DWARF procedure to compute the program
2204location for each lane assuming they are divergent, and then modifies the result
2205by inserting the current program location for each lane that the ``EXEC`` mask
2206indicates is active.
2207
2208By having separate DWARF procedures for each region, they can be reused to
2209define the value for any nested region. This reduces the total size of the DWARF
2210operation expressions.
2211
2212The following provides an example using pseudo LLVM MIR.
2213
2214.. code::
2215  :number-lines:
2216
2217  $lex_start:
2218    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2219      DW_AT_name = "__uint64";
2220      DW_AT_byte_size = 8;
2221      DW_AT_encoding = DW_ATE_unsigned;
2222    ];
2223    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2224      DW_AT_name = "__active_lane_pc";
2225      DW_AT_location = [
2226        DW_OP_regx PC;
2227        DW_OP_LLVM_extend 64, 64;
2228        DW_OP_regval_type EXEC, %uint_64;
2229        DW_OP_LLVM_select_bit_piece 64, 64;
2230      ];
2231    ];
2232    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2233      DW_AT_name = "__divergent_lane_pc";
2234      DW_AT_location = [
2235        DW_OP_LLVM_undefined;
2236        DW_OP_LLVM_extend 64, 64;
2237      ];
2238    ];
2239    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2240      DW_OP_call_ref %__divergent_lane_pc;
2241      DW_OP_call_ref %__active_lane_pc;
2242    ];
2243    a;
2244    %1 = EXEC;
2245    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2246    %2 = c1;
2247  $lex_1_start:
2248    EXEC = %1 & %2;
2249  $lex_1_then:
2250      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2251        DW_AT_name = "__divergent_lane_pc_1_then";
2252        DW_AT_location = DIExpression[
2253          DW_OP_call_ref %__divergent_lane_pc;
2254          DW_OP_addrx &lex_1_start;
2255          DW_OP_stack_value;
2256          DW_OP_LLVM_extend 64, 64;
2257          DW_OP_call_ref %__lex_1_save_exec;
2258          DW_OP_deref_type 64, %__uint_64;
2259          DW_OP_LLVM_select_bit_piece 64, 64;
2260        ];
2261      ];
2262      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2263        DW_OP_call_ref %__divergent_lane_pc_1_then;
2264        DW_OP_call_ref %__active_lane_pc;
2265      ];
2266      b;
2267      %3 = EXEC;
2268      DBG_VALUE %3, %__lex_1_1_save_exec;
2269      %4 = c2;
2270  $lex_1_1_start:
2271      EXEC = %3 & %4;
2272  $lex_1_1_then:
2273        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2274          DW_AT_name = "__divergent_lane_pc_1_1_then";
2275          DW_AT_location = DIExpression[
2276            DW_OP_call_ref %__divergent_lane_pc_1_then;
2277            DW_OP_addrx &lex_1_1_start;
2278            DW_OP_stack_value;
2279            DW_OP_LLVM_extend 64, 64;
2280            DW_OP_call_ref %__lex_1_1_save_exec;
2281            DW_OP_deref_type 64, %__uint_64;
2282            DW_OP_LLVM_select_bit_piece 64, 64;
2283          ];
2284        ];
2285        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2286          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2287          DW_OP_call_ref %__active_lane_pc;
2288        ];
2289        c;
2290      EXEC = ~EXEC & %3;
2291  $lex_1_1_else:
2292        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2293          DW_AT_name = "__divergent_lane_pc_1_1_else";
2294          DW_AT_location = DIExpression[
2295            DW_OP_call_ref %__divergent_lane_pc_1_then;
2296            DW_OP_addrx &lex_1_1_end;
2297            DW_OP_stack_value;
2298            DW_OP_LLVM_extend 64, 64;
2299            DW_OP_call_ref %__lex_1_1_save_exec;
2300            DW_OP_deref_type 64, %__uint_64;
2301            DW_OP_LLVM_select_bit_piece 64, 64;
2302          ];
2303        ];
2304        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2305          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2306          DW_OP_call_ref %__active_lane_pc;
2307        ];
2308        d;
2309      EXEC = %3;
2310  $lex_1_1_end:
2311      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2312        DW_OP_call_ref %__divergent_lane_pc;
2313        DW_OP_call_ref %__active_lane_pc;
2314      ];
2315      e;
2316    EXEC = ~EXEC & %1;
2317  $lex_1_else:
2318      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2319        DW_AT_name = "__divergent_lane_pc_1_else";
2320        DW_AT_location = DIExpression[
2321          DW_OP_call_ref %__divergent_lane_pc;
2322          DW_OP_addrx &lex_1_end;
2323          DW_OP_stack_value;
2324          DW_OP_LLVM_extend 64, 64;
2325          DW_OP_call_ref %__lex_1_save_exec;
2326          DW_OP_deref_type 64, %__uint_64;
2327          DW_OP_LLVM_select_bit_piece 64, 64;
2328        ];
2329      ];
2330      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2331        DW_OP_call_ref %__divergent_lane_pc_1_else;
2332        DW_OP_call_ref %__active_lane_pc;
2333      ];
2334      f;
2335    EXEC = %1;
2336  $lex_1_end:
2337    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2338      DW_OP_call_ref %__divergent_lane_pc;
2339      DW_OP_call_ref %__active_lane_pc;
2340    ];
2341    g;
2342  $lex_end:
2343
2344The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2345that are active, with the current program location.
2346
2347Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2348the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2349instruction, location list entries will be created that describe where the
2350artificial variables are allocated at any given program location. The compiler
2351may allocate them to registers or spill them to memory.
2352
2353The DWARF procedures for each region use the values of the saved execution mask
2354artificial variables to only update the lanes that are active on entry to the
2355region. All other lanes retain the value of the enclosing region where they were
2356last active. If they were not active on entry to the subprogram, then will have
2357the undefined location description.
2358
2359Other structured control flow regions can be handled similarly. For example,
2360loops would set the divergent program location for the region at the end of the
2361loop. Any lanes active will be in the loop, and any lanes not active must have
2362exited the loop.
2363
2364An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2365``IF/THEN/ELSE`` regions.
2366
2367The DWARF procedures can use the active lane artificial variable described in
2368:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2369``EXEC`` mask in order to support whole or quad wavefront mode.
2370
2371.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2372
2373``DW_AT_LLVM_active_lane``
2374~~~~~~~~~~~~~~~~~~~~~~~~~~
2375
2376The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2377entry is used to specify the lanes that are conceptually active for a SIMT
2378thread.
2379
2380The execution mask may be modified to implement whole or quad wavefront mode
2381operations. For example, all lanes may need to temporarily be made active to
2382execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2383update it to enable the necessary lanes, perform the operations, and then
2384restore the ``EXEC`` mask from the saved value. While executing the whole
2385wavefront region, the conceptual execution mask is the saved value, not the
2386``EXEC`` value.
2387
2388This is handled by defining an artificial variable for the active lane mask. The
2389active lane mask artificial variable would be the actual ``EXEC`` mask for
2390normal regions, and the saved execution mask for regions where the mask is
2391temporarily updated. The location list expression created for this artificial
2392variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2393attribute.
2394
2395``DW_AT_LLVM_augmentation``
2396~~~~~~~~~~~~~~~~~~~~~~~~~~~
2397
2398For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2399debugger information entry has the following value for the augmentation string:
2400
2401::
2402
2403  [amdgpu:v0.0]
2404
2405The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2406extensions used in the DWARF of the compilation unit. The version number
2407conforms to [SEMVER]_.
2408
2409Call Frame Information
2410----------------------
2411
2412DWARF Call Frame Information (CFI) describes how a consumer can virtually
2413*unwind* call frames in a running process or core dump. See DWARF Version 5
2414section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2415
2416For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2417
24181.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2419
2420    ::
2421
2422      [amd:v0.0]
2423
2424    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2425    extensions used in this CIE or to the FDEs that use it. The version number
2426    conforms to [SEMVER]_.
2427
24282.  ``address_size`` for the ``Global`` address space is defined in
2429    :ref:`amdgpu-dwarf-address-space-identifier`.
2430
24313.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2432
24334.  ``code_alignment_factor`` is 4 bytes.
2434
2435    .. TODO::
2436
2437       Add to :ref:`amdgpu-processor-table` table.
2438
24395.  ``data_alignment_factor`` is 4 bytes.
2440
2441    .. TODO::
2442
2443       Add to :ref:`amdgpu-processor-table` table.
2444
24456.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2446    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2447
24487.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2449    called from subprogram Y that has more allocated, X will not change any of
2450    the extra registers as it cannot access them. Therefore, the default rule
2451    for all columns is ``same value``.
2452
2453For AMDGPU the register number follows the numbering defined in
2454:ref:`amdgpu-dwarf-register-identifier`.
2455
2456For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2457the return address to get the address of a byte within the call site
2458instructions. See DWARF Version 5 section 6.4.4.
2459
2460Accelerated Access
2461------------------
2462
2463See DWARF Version 5 section 6.1.
2464
2465Lookup By Name Section Header
2466~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2467
2468See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2469
2470For AMDGPU the lookup by name section header table:
2471
2472``augmentation_string_size`` (uword)
2473
2474  Set to the length of the ``augmentation_string`` value which is always a
2475  multiple of 4.
2476
2477``augmentation_string`` (sequence of UTF-8 characters)
2478
2479  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2480
2481  ::
2482
2483    [amdgpu:v0.0]
2484
2485  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2486  extensions used in the DWARF of this index. The version number conforms to
2487  [SEMVER]_.
2488
2489  .. note::
2490
2491    This is different to the DWARF Version 5 definition that requires the first
2492    4 characters to be the vendor ID. But this is consistent with the other
2493    augmentation strings and does allow multiple vendor contributions. However,
2494    backwards compatibility may be more desirable.
2495
2496Lookup By Address Section Header
2497~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2498
2499See DWARF Version 5 section 6.1.2.
2500
2501For AMDGPU the lookup by address section header table:
2502
2503``address_size`` (ubyte)
2504
2505  Match the address size for the ``Global`` address space defined in
2506  :ref:`amdgpu-dwarf-address-space-identifier`.
2507
2508``segment_selector_size`` (ubyte)
2509
2510  AMDGPU does not use a segment selector so this is 0. The entries in the
2511  ``.debug_aranges`` do not have a segment selector.
2512
2513Line Number Information
2514-----------------------
2515
2516See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2517
2518AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2519The instruction set must be obtained from the ELF file header ``e_flags`` field
2520in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2521<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2522
2523.. TODO::
2524
2525  Should the ``isa`` state machine register be used to indicate if the code is
2526  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2527
2528For AMDGPU the line number program header fields have the following values (see
2529DWARF Version 5 section 6.2.4):
2530
2531``address_size`` (ubyte)
2532  Matches the address size for the ``Global`` address space defined in
2533  :ref:`amdgpu-dwarf-address-space-identifier`.
2534
2535``segment_selector_size`` (ubyte)
2536  AMDGPU does not use a segment selector so this is 0.
2537
2538``minimum_instruction_length`` (ubyte)
2539  For GFX9-GFX11 this is 4.
2540
2541``maximum_operations_per_instruction`` (ubyte)
2542  For GFX9-GFX11 this is 1.
2543
2544Source text for online-compiled programs (for example, those compiled by the
2545OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2546See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2547Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2548<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2549
2550The Clang option used to control source embedding in AMDGPU is defined in
2551:ref:`amdgpu-clang-debug-options-table`.
2552
2553  .. table:: AMDGPU Clang Debug Options
2554     :name: amdgpu-clang-debug-options-table
2555
2556     ==================== ==================================================
2557     Debug Flag           Description
2558     ==================== ==================================================
2559     -g[no-]embed-source  Enable/disable embedding source text in DWARF
2560                          debug sections. Useful for environments where
2561                          source cannot be written to disk, such as
2562                          when performing online compilation.
2563     ==================== ==================================================
2564
2565For example:
2566
2567``-gembed-source``
2568  Enable the embedded source.
2569
2570``-gno-embed-source``
2571  Disable the embedded source.
2572
257332-Bit and 64-Bit DWARF Formats
2574-------------------------------
2575
2576See DWARF Version 5 section 7.4 and
2577:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2578
2579For AMDGPU:
2580
2581* For the ``amdgcn`` target architecture only the 64-bit process address space
2582  is supported.
2583
2584* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2585  the 32-bit DWARF format.
2586
2587Unit Headers
2588------------
2589
2590For AMDGPU the following values apply for each of the unit headers described in
2591DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2592
2593``address_size`` (ubyte)
2594  Matches the address size for the ``Global`` address space defined in
2595  :ref:`amdgpu-dwarf-address-space-identifier`.
2596
2597.. _amdgpu-code-conventions:
2598
2599Code Conventions
2600================
2601
2602This section provides code conventions used for each supported target triple OS
2603(see :ref:`amdgpu-target-triples`).
2604
2605AMDHSA
2606------
2607
2608This section provides code conventions used when the target triple OS is
2609``amdhsa`` (see :ref:`amdgpu-target-triples`).
2610
2611.. _amdgpu-amdhsa-code-object-metadata:
2612
2613Code Object Metadata
2614~~~~~~~~~~~~~~~~~~~~
2615
2616The code object metadata specifies extensible metadata associated with the code
2617objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2618encoding and semantics of this metadata depends on the code object version; see
2619:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2620:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2621:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2622:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2623
2624Code object metadata is specified in a note record (see
2625:ref:`amdgpu-note-records`) and is required when the target triple OS is
2626``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2627information necessary to support the HSA compatible runtime kernel queries. For
2628example, the segment sizes needed in a dispatch packet. In addition, a
2629high-level language runtime may require other information to be included. For
2630example, the AMD OpenCL runtime records kernel argument information.
2631
2632.. _amdgpu-amdhsa-code-object-metadata-v2:
2633
2634Code Object V2 Metadata
2635+++++++++++++++++++++++
2636
2637.. warning::
2638  Code object V2 is not the default code object version emitted by this version
2639  of LLVM.
2640
2641Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2642(see :ref:`amdgpu-note-records-v2`).
2643
2644The metadata is specified as a YAML formatted string (see [YAML]_ and
2645:doc:`YamlIO`).
2646
2647.. TODO::
2648
2649  Is the string null terminated? It probably should not if YAML allows it to
2650  contain null characters, otherwise it should be.
2651
2652The metadata is represented as a single YAML document comprised of the mapping
2653defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2654referenced tables.
2655
2656For boolean values, the string values of ``false`` and ``true`` are used for
2657false and true respectively.
2658
2659Additional information can be added to the mappings. To avoid conflicts, any
2660non-AMD key names should be prefixed by "*vendor-name*.".
2661
2662  .. table:: AMDHSA Code Object V2 Metadata Map
2663     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2664
2665     ========== ============== ========= =======================================
2666     String Key Value Type     Required? Description
2667     ========== ============== ========= =======================================
2668     "Version"  sequence of    Required  - The first integer is the major
2669                2 integers                 version. Currently 1.
2670                                         - The second integer is the minor
2671                                           version. Currently 0.
2672     "Printf"   sequence of              Each string is encoded information
2673                strings                  about a printf function call. The
2674                                         encoded information is organized as
2675                                         fields separated by colon (':'):
2676
2677                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2678
2679                                         where:
2680
2681                                         ``ID``
2682                                           A 32-bit integer as a unique id for
2683                                           each printf function call
2684
2685                                         ``N``
2686                                           A 32-bit integer equal to the number
2687                                           of arguments of printf function call
2688                                           minus 1
2689
2690                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2691                                           32-bit integers for the size in bytes
2692                                           of the i-th FormatString argument of
2693                                           the printf function call
2694
2695                                         FormatString
2696                                           The format string passed to the
2697                                           printf function call.
2698     "Kernels"  sequence of    Required  Sequence of the mappings for each
2699                mapping                  kernel in the code object. See
2700                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2701                                         for the definition of the mapping.
2702     ========== ============== ========= =======================================
2703
2704..
2705
2706  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2707     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2708
2709     ================= ============== ========= ================================
2710     String Key        Value Type     Required? Description
2711     ================= ============== ========= ================================
2712     "Name"            string         Required  Source name of the kernel.
2713     "SymbolName"      string         Required  Name of the kernel
2714                                                descriptor ELF symbol.
2715     "Language"        string                   Source language of the kernel.
2716                                                Values include:
2717
2718                                                - "OpenCL C"
2719                                                - "OpenCL C++"
2720                                                - "HCC"
2721                                                - "OpenMP"
2722
2723     "LanguageVersion" sequence of              - The first integer is the major
2724                       2 integers                 version.
2725                                                - The second integer is the
2726                                                  minor version.
2727     "Attrs"           mapping                  Mapping of kernel attributes.
2728                                                See
2729                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2730                                                for the mapping definition.
2731     "Args"            sequence of              Sequence of mappings of the
2732                       mapping                  kernel arguments. See
2733                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2734                                                for the definition of the mapping.
2735     "CodeProps"       mapping                  Mapping of properties related to
2736                                                the kernel code. See
2737                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2738                                                for the mapping definition.
2739     ================= ============== ========= ================================
2740
2741..
2742
2743  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2744     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2745
2746     =================== ============== ========= ==============================
2747     String Key          Value Type     Required? Description
2748     =================== ============== ========= ==============================
2749     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2750                         3 integers               must be >=1 and the dispatch
2751                                                  work-group size X, Y, Z must
2752                                                  correspond to the specified
2753                                                  values. Defaults to 0, 0, 0.
2754
2755                                                  Corresponds to the OpenCL
2756                                                  ``reqd_work_group_size``
2757                                                  attribute.
2758     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2759                         3 integers               X, Y, Z is likely to be the
2760                                                  specified values.
2761
2762                                                  Corresponds to the OpenCL
2763                                                  ``work_group_size_hint``
2764                                                  attribute.
2765     "VecTypeHint"       string                   The name of a scalar or vector
2766                                                  type.
2767
2768                                                  Corresponds to the OpenCL
2769                                                  ``vec_type_hint`` attribute.
2770
2771     "RuntimeHandle"     string                   The external symbol name
2772                                                  associated with a kernel.
2773                                                  OpenCL runtime allocates a
2774                                                  global buffer for the symbol
2775                                                  and saves the kernel's address
2776                                                  to it, which is used for
2777                                                  device side enqueueing. Only
2778                                                  available for device side
2779                                                  enqueued kernels.
2780     =================== ============== ========= ==============================
2781
2782..
2783
2784  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2785     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2786
2787     ================= ============== ========= ================================
2788     String Key        Value Type     Required? Description
2789     ================= ============== ========= ================================
2790     "Name"            string                   Kernel argument name.
2791     "TypeName"        string                   Kernel argument type name.
2792     "Size"            integer        Required  Kernel argument size in bytes.
2793     "Align"           integer        Required  Kernel argument alignment in
2794                                                bytes. Must be a power of two.
2795     "ValueKind"       string         Required  Kernel argument kind that
2796                                                specifies how to set up the
2797                                                corresponding argument.
2798                                                Values include:
2799
2800                                                "ByValue"
2801                                                  The argument is copied
2802                                                  directly into the kernarg.
2803
2804                                                "GlobalBuffer"
2805                                                  A global address space pointer
2806                                                  to the buffer data is passed
2807                                                  in the kernarg.
2808
2809                                                "DynamicSharedPointer"
2810                                                  A group address space pointer
2811                                                  to dynamically allocated LDS
2812                                                  is passed in the kernarg.
2813
2814                                                "Sampler"
2815                                                  A global address space
2816                                                  pointer to a S# is passed in
2817                                                  the kernarg.
2818
2819                                                "Image"
2820                                                  A global address space
2821                                                  pointer to a T# is passed in
2822                                                  the kernarg.
2823
2824                                                "Pipe"
2825                                                  A global address space pointer
2826                                                  to an OpenCL pipe is passed in
2827                                                  the kernarg.
2828
2829                                                "Queue"
2830                                                  A global address space pointer
2831                                                  to an OpenCL device enqueue
2832                                                  queue is passed in the
2833                                                  kernarg.
2834
2835                                                "HiddenGlobalOffsetX"
2836                                                  The OpenCL grid dispatch
2837                                                  global offset for the X
2838                                                  dimension is passed in the
2839                                                  kernarg.
2840
2841                                                "HiddenGlobalOffsetY"
2842                                                  The OpenCL grid dispatch
2843                                                  global offset for the Y
2844                                                  dimension is passed in the
2845                                                  kernarg.
2846
2847                                                "HiddenGlobalOffsetZ"
2848                                                  The OpenCL grid dispatch
2849                                                  global offset for the Z
2850                                                  dimension is passed in the
2851                                                  kernarg.
2852
2853                                                "HiddenNone"
2854                                                  An argument that is not used
2855                                                  by the kernel. Space needs to
2856                                                  be left for it, but it does
2857                                                  not need to be set up.
2858
2859                                                "HiddenPrintfBuffer"
2860                                                  A global address space pointer
2861                                                  to the runtime printf buffer
2862                                                  is passed in kernarg. Mutually
2863                                                  exclusive with
2864                                                  "HiddenHostcallBuffer".
2865
2866                                                "HiddenHostcallBuffer"
2867                                                  A global address space pointer
2868                                                  to the runtime hostcall buffer
2869                                                  is passed in kernarg. Mutually
2870                                                  exclusive with
2871                                                  "HiddenPrintfBuffer".
2872
2873                                                "HiddenDefaultQueue"
2874                                                  A global address space pointer
2875                                                  to the OpenCL device enqueue
2876                                                  queue that should be used by
2877                                                  the kernel by default is
2878                                                  passed in the kernarg.
2879
2880                                                "HiddenCompletionAction"
2881                                                  A global address space pointer
2882                                                  to help link enqueued kernels into
2883                                                  the ancestor tree for determining
2884                                                  when the parent kernel has finished.
2885
2886                                                "HiddenMultiGridSyncArg"
2887                                                  A global address space pointer for
2888                                                  multi-grid synchronization is
2889                                                  passed in the kernarg.
2890
2891     "ValueType"       string                   Unused and deprecated. This should no longer
2892                                                be emitted, but is accepted for compatibility.
2893
2894
2895     "PointeeAlign"    integer                  Alignment in bytes of pointee
2896                                                type for pointer type kernel
2897                                                argument. Must be a power
2898                                                of 2. Only present if
2899                                                "ValueKind" is
2900                                                "DynamicSharedPointer".
2901     "AddrSpaceQual"   string                   Kernel argument address space
2902                                                qualifier. Only present if
2903                                                "ValueKind" is "GlobalBuffer" or
2904                                                "DynamicSharedPointer". Values
2905                                                are:
2906
2907                                                - "Private"
2908                                                - "Global"
2909                                                - "Constant"
2910                                                - "Local"
2911                                                - "Generic"
2912                                                - "Region"
2913
2914                                                .. TODO::
2915
2916                                                   Is GlobalBuffer only Global
2917                                                   or Constant? Is
2918                                                   DynamicSharedPointer always
2919                                                   Local? Can HCC allow Generic?
2920                                                   How can Private or Region
2921                                                   ever happen?
2922
2923     "AccQual"         string                   Kernel argument access
2924                                                qualifier. Only present if
2925                                                "ValueKind" is "Image" or
2926                                                "Pipe". Values
2927                                                are:
2928
2929                                                - "ReadOnly"
2930                                                - "WriteOnly"
2931                                                - "ReadWrite"
2932
2933                                                .. TODO::
2934
2935                                                   Does this apply to
2936                                                   GlobalBuffer?
2937
2938     "ActualAccQual"   string                   The actual memory accesses
2939                                                performed by the kernel on the
2940                                                kernel argument. Only present if
2941                                                "ValueKind" is "GlobalBuffer",
2942                                                "Image", or "Pipe". This may be
2943                                                more restrictive than indicated
2944                                                by "AccQual" to reflect what the
2945                                                kernel actual does. If not
2946                                                present then the runtime must
2947                                                assume what is implied by
2948                                                "AccQual" and "IsConst". Values
2949                                                are:
2950
2951                                                - "ReadOnly"
2952                                                - "WriteOnly"
2953                                                - "ReadWrite"
2954
2955     "IsConst"         boolean                  Indicates if the kernel argument
2956                                                is const qualified. Only present
2957                                                if "ValueKind" is
2958                                                "GlobalBuffer".
2959
2960     "IsRestrict"      boolean                  Indicates if the kernel argument
2961                                                is restrict qualified. Only
2962                                                present if "ValueKind" is
2963                                                "GlobalBuffer".
2964
2965     "IsVolatile"      boolean                  Indicates if the kernel argument
2966                                                is volatile qualified. Only
2967                                                present if "ValueKind" is
2968                                                "GlobalBuffer".
2969
2970     "IsPipe"          boolean                  Indicates if the kernel argument
2971                                                is pipe qualified. Only present
2972                                                if "ValueKind" is "Pipe".
2973
2974                                                .. TODO::
2975
2976                                                   Can GlobalBuffer be pipe
2977                                                   qualified?
2978
2979     ================= ============== ========= ================================
2980
2981..
2982
2983  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2984     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2985
2986     ============================ ============== ========= =====================
2987     String Key                   Value Type     Required? Description
2988     ============================ ============== ========= =====================
2989     "KernargSegmentSize"         integer        Required  The size in bytes of
2990                                                           the kernarg segment
2991                                                           that holds the values
2992                                                           of the arguments to
2993                                                           the kernel.
2994     "GroupSegmentFixedSize"      integer        Required  The amount of group
2995                                                           segment memory
2996                                                           required by a
2997                                                           work-group in
2998                                                           bytes. This does not
2999                                                           include any
3000                                                           dynamically allocated
3001                                                           group segment memory
3002                                                           that may be added
3003                                                           when the kernel is
3004                                                           dispatched.
3005     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
3006                                                           private address space
3007                                                           memory required for a
3008                                                           work-item in
3009                                                           bytes. If the kernel
3010                                                           uses a dynamic call
3011                                                           stack then additional
3012                                                           space must be added
3013                                                           to this value for the
3014                                                           call stack.
3015     "KernargSegmentAlign"        integer        Required  The maximum byte
3016                                                           alignment of
3017                                                           arguments in the
3018                                                           kernarg segment. Must
3019                                                           be a power of 2.
3020     "WavefrontSize"              integer        Required  Wavefront size. Must
3021                                                           be a power of 2.
3022     "NumSGPRs"                   integer        Required  Number of scalar
3023                                                           registers used by a
3024                                                           wavefront for
3025                                                           GFX6-GFX11. This
3026                                                           includes the special
3027                                                           SGPRs for VCC, Flat
3028                                                           Scratch (GFX7-GFX10)
3029                                                           and XNACK (for
3030                                                           GFX8-GFX10). It does
3031                                                           not include the 16
3032                                                           SGPR added if a trap
3033                                                           handler is
3034                                                           enabled. It is not
3035                                                           rounded up to the
3036                                                           allocation
3037                                                           granularity.
3038     "NumVGPRs"                   integer        Required  Number of vector
3039                                                           registers used by
3040                                                           each work-item for
3041                                                           GFX6-GFX11
3042     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
3043                                                           work-group size
3044                                                           supported by the
3045                                                           kernel in work-items.
3046                                                           Must be >=1 and
3047                                                           consistent with
3048                                                           ReqdWorkGroupSize if
3049                                                           not 0, 0, 0.
3050     "NumSpilledSGPRs"            integer                  Number of stores from
3051                                                           a scalar register to
3052                                                           a register allocator
3053                                                           created spill
3054                                                           location.
3055     "NumSpilledVGPRs"            integer                  Number of stores from
3056                                                           a vector register to
3057                                                           a register allocator
3058                                                           created spill
3059                                                           location.
3060     ============================ ============== ========= =====================
3061
3062.. _amdgpu-amdhsa-code-object-metadata-v3:
3063
3064Code Object V3 Metadata
3065+++++++++++++++++++++++
3066
3067.. warning::
3068  Code object V3 is not the default code object version emitted by this version
3069  of LLVM.
3070
3071Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3072record (see :ref:`amdgpu-note-records-v3-onwards`).
3073
3074The metadata is represented as Message Pack formatted binary data (see
3075[MsgPack]_). The top level is a Message Pack map that includes the
3076keys defined in table
3077:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3078tables.
3079
3080Additional information can be added to the maps. To avoid conflicts,
3081any key names should be prefixed by "*vendor-name*." where
3082``vendor-name`` can be the name of the vendor and specific vendor
3083tool that generates the information. The prefix is abbreviated to
3084simply "." when it appears within a map that has been added by the
3085same *vendor-name*.
3086
3087  .. table:: AMDHSA Code Object V3 Metadata Map
3088     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3089
3090     ================= ============== ========= =======================================
3091     String Key        Value Type     Required? Description
3092     ================= ============== ========= =======================================
3093     "amdhsa.version"  sequence of    Required  - The first integer is the major
3094                       2 integers                 version. Currently 1.
3095                                                - The second integer is the minor
3096                                                  version. Currently 0.
3097     "amdhsa.printf"   sequence of              Each string is encoded information
3098                       strings                  about a printf function call. The
3099                                                encoded information is organized as
3100                                                fields separated by colon (':'):
3101
3102                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3103
3104                                                where:
3105
3106                                                ``ID``
3107                                                  A 32-bit integer as a unique id for
3108                                                  each printf function call
3109
3110                                                ``N``
3111                                                  A 32-bit integer equal to the number
3112                                                  of arguments of printf function call
3113                                                  minus 1
3114
3115                                                ``S[i]`` (where i = 0, 1, ... , N-1)
3116                                                  32-bit integers for the size in bytes
3117                                                  of the i-th FormatString argument of
3118                                                  the printf function call
3119
3120                                                FormatString
3121                                                  The format string passed to the
3122                                                  printf function call.
3123     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3124                       map                      kernel in the code object. See
3125                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3126                                                for the definition of the keys included
3127                                                in that map.
3128     ================= ============== ========= =======================================
3129
3130..
3131
3132  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3133     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3134
3135     =================================== ============== ========= ================================
3136     String Key                          Value Type     Required? Description
3137     =================================== ============== ========= ================================
3138     ".name"                             string         Required  Source name of the kernel.
3139     ".symbol"                           string         Required  Name of the kernel
3140                                                                  descriptor ELF symbol.
3141     ".language"                         string                   Source language of the kernel.
3142                                                                  Values include:
3143
3144                                                                  - "OpenCL C"
3145                                                                  - "OpenCL C++"
3146                                                                  - "HCC"
3147                                                                  - "HIP"
3148                                                                  - "OpenMP"
3149                                                                  - "Assembler"
3150
3151     ".language_version"                 sequence of              - The first integer is the major
3152                                         2 integers                 version.
3153                                                                  - The second integer is the
3154                                                                    minor version.
3155     ".args"                             sequence of              Sequence of maps of the
3156                                         map                      kernel arguments. See
3157                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3158                                                                  for the definition of the keys
3159                                                                  included in that map.
3160     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3161                                         3 integers               must be >=1 and the dispatch
3162                                                                  work-group size X, Y, Z must
3163                                                                  correspond to the specified
3164                                                                  values. Defaults to 0, 0, 0.
3165
3166                                                                  Corresponds to the OpenCL
3167                                                                  ``reqd_work_group_size``
3168                                                                  attribute.
3169     ".workgroup_size_hint"              sequence of              The dispatch work-group size
3170                                         3 integers               X, Y, Z is likely to be the
3171                                                                  specified values.
3172
3173                                                                  Corresponds to the OpenCL
3174                                                                  ``work_group_size_hint``
3175                                                                  attribute.
3176     ".vec_type_hint"                    string                   The name of a scalar or vector
3177                                                                  type.
3178
3179                                                                  Corresponds to the OpenCL
3180                                                                  ``vec_type_hint`` attribute.
3181
3182     ".device_enqueue_symbol"            string                   The external symbol name
3183                                                                  associated with a kernel.
3184                                                                  OpenCL runtime allocates a
3185                                                                  global buffer for the symbol
3186                                                                  and saves the kernel's address
3187                                                                  to it, which is used for
3188                                                                  device side enqueueing. Only
3189                                                                  available for device side
3190                                                                  enqueued kernels.
3191     ".kernarg_segment_size"             integer        Required  The size in bytes of
3192                                                                  the kernarg segment
3193                                                                  that holds the values
3194                                                                  of the arguments to
3195                                                                  the kernel.
3196     ".group_segment_fixed_size"         integer        Required  The amount of group
3197                                                                  segment memory
3198                                                                  required by a
3199                                                                  work-group in
3200                                                                  bytes. This does not
3201                                                                  include any
3202                                                                  dynamically allocated
3203                                                                  group segment memory
3204                                                                  that may be added
3205                                                                  when the kernel is
3206                                                                  dispatched.
3207     ".private_segment_fixed_size"       integer        Required  The amount of fixed
3208                                                                  private address space
3209                                                                  memory required for a
3210                                                                  work-item in
3211                                                                  bytes. If the kernel
3212                                                                  uses a dynamic call
3213                                                                  stack then additional
3214                                                                  space must be added
3215                                                                  to this value for the
3216                                                                  call stack.
3217     ".kernarg_segment_align"            integer        Required  The maximum byte
3218                                                                  alignment of
3219                                                                  arguments in the
3220                                                                  kernarg segment. Must
3221                                                                  be a power of 2.
3222     ".wavefront_size"                   integer        Required  Wavefront size. Must
3223                                                                  be a power of 2.
3224     ".sgpr_count"                       integer        Required  Number of scalar
3225                                                                  registers required by a
3226                                                                  wavefront for
3227                                                                  GFX6-GFX9. A register
3228                                                                  is required if it is
3229                                                                  used explicitly, or
3230                                                                  if a higher numbered
3231                                                                  register is used
3232                                                                  explicitly. This
3233                                                                  includes the special
3234                                                                  SGPRs for VCC, Flat
3235                                                                  Scratch (GFX7-GFX9)
3236                                                                  and XNACK (for
3237                                                                  GFX8-GFX9). It does
3238                                                                  not include the 16
3239                                                                  SGPR added if a trap
3240                                                                  handler is
3241                                                                  enabled. It is not
3242                                                                  rounded up to the
3243                                                                  allocation
3244                                                                  granularity.
3245     ".vgpr_count"                       integer        Required  Number of vector
3246                                                                  registers required by
3247                                                                  each work-item for
3248                                                                  GFX6-GFX9. A register
3249                                                                  is required if it is
3250                                                                  used explicitly, or
3251                                                                  if a higher numbered
3252                                                                  register is used
3253                                                                  explicitly.
3254     ".agpr_count"                       integer        Required  Number of accumulator
3255                                                                  registers required by
3256                                                                  each work-item for
3257                                                                  GFX90A, GFX908.
3258     ".max_flat_workgroup_size"          integer        Required  Maximum flat
3259                                                                  work-group size
3260                                                                  supported by the
3261                                                                  kernel in work-items.
3262                                                                  Must be >=1 and
3263                                                                  consistent with
3264                                                                  ReqdWorkGroupSize if
3265                                                                  not 0, 0, 0.
3266     ".sgpr_spill_count"                 integer                  Number of stores from
3267                                                                  a scalar register to
3268                                                                  a register allocator
3269                                                                  created spill
3270                                                                  location.
3271     ".vgpr_spill_count"                 integer                  Number of stores from
3272                                                                  a vector register to
3273                                                                  a register allocator
3274                                                                  created spill
3275                                                                  location.
3276     ".kind"                             string                   The kind of the kernel
3277                                                                  with the following
3278                                                                  values:
3279
3280                                                                  "normal"
3281                                                                    Regular kernels.
3282
3283                                                                  "init"
3284                                                                    These kernels must be
3285                                                                    invoked after loading
3286                                                                    the containing code
3287                                                                    object and must
3288                                                                    complete before any
3289                                                                    normal and fini
3290                                                                    kernels in the same
3291                                                                    code object are
3292                                                                    invoked.
3293
3294                                                                  "fini"
3295                                                                    These kernels must be
3296                                                                    invoked before
3297                                                                    unloading the
3298                                                                    containing code object
3299                                                                    and after all init and
3300                                                                    normal kernels in the
3301                                                                    same code object have
3302                                                                    been invoked and
3303                                                                    completed.
3304
3305                                                                  If omitted, "normal" is
3306                                                                  assumed.
3307     =================================== ============== ========= ================================
3308
3309..
3310
3311  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3312     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3313
3314     ====================== ============== ========= ================================
3315     String Key             Value Type     Required? Description
3316     ====================== ============== ========= ================================
3317     ".name"                string                   Kernel argument name.
3318     ".type_name"           string                   Kernel argument type name.
3319     ".size"                integer        Required  Kernel argument size in bytes.
3320     ".offset"              integer        Required  Kernel argument offset in
3321                                                     bytes. The offset must be a
3322                                                     multiple of the alignment
3323                                                     required by the argument.
3324     ".value_kind"          string         Required  Kernel argument kind that
3325                                                     specifies how to set up the
3326                                                     corresponding argument.
3327                                                     Values include:
3328
3329                                                     "by_value"
3330                                                       The argument is copied
3331                                                       directly into the kernarg.
3332
3333                                                     "global_buffer"
3334                                                       A global address space pointer
3335                                                       to the buffer data is passed
3336                                                       in the kernarg.
3337
3338                                                     "dynamic_shared_pointer"
3339                                                       A group address space pointer
3340                                                       to dynamically allocated LDS
3341                                                       is passed in the kernarg.
3342
3343                                                     "sampler"
3344                                                       A global address space
3345                                                       pointer to a S# is passed in
3346                                                       the kernarg.
3347
3348                                                     "image"
3349                                                       A global address space
3350                                                       pointer to a T# is passed in
3351                                                       the kernarg.
3352
3353                                                     "pipe"
3354                                                       A global address space pointer
3355                                                       to an OpenCL pipe is passed in
3356                                                       the kernarg.
3357
3358                                                     "queue"
3359                                                       A global address space pointer
3360                                                       to an OpenCL device enqueue
3361                                                       queue is passed in the
3362                                                       kernarg.
3363
3364                                                     "hidden_global_offset_x"
3365                                                       The OpenCL grid dispatch
3366                                                       global offset for the X
3367                                                       dimension is passed in the
3368                                                       kernarg.
3369
3370                                                     "hidden_global_offset_y"
3371                                                       The OpenCL grid dispatch
3372                                                       global offset for the Y
3373                                                       dimension is passed in the
3374                                                       kernarg.
3375
3376                                                     "hidden_global_offset_z"
3377                                                       The OpenCL grid dispatch
3378                                                       global offset for the Z
3379                                                       dimension is passed in the
3380                                                       kernarg.
3381
3382                                                     "hidden_none"
3383                                                       An argument that is not used
3384                                                       by the kernel. Space needs to
3385                                                       be left for it, but it does
3386                                                       not need to be set up.
3387
3388                                                     "hidden_printf_buffer"
3389                                                       A global address space pointer
3390                                                       to the runtime printf buffer
3391                                                       is passed in kernarg. Mutually
3392                                                       exclusive with
3393                                                       "hidden_hostcall_buffer"
3394                                                       before Code Object V5.
3395
3396                                                     "hidden_hostcall_buffer"
3397                                                       A global address space pointer
3398                                                       to the runtime hostcall buffer
3399                                                       is passed in kernarg. Mutually
3400                                                       exclusive with
3401                                                       "hidden_printf_buffer"
3402                                                       before Code Object V5.
3403
3404                                                     "hidden_default_queue"
3405                                                       A global address space pointer
3406                                                       to the OpenCL device enqueue
3407                                                       queue that should be used by
3408                                                       the kernel by default is
3409                                                       passed in the kernarg.
3410
3411                                                     "hidden_completion_action"
3412                                                       A global address space pointer
3413                                                       to help link enqueued kernels into
3414                                                       the ancestor tree for determining
3415                                                       when the parent kernel has finished.
3416
3417                                                     "hidden_multigrid_sync_arg"
3418                                                       A global address space pointer for
3419                                                       multi-grid synchronization is
3420                                                       passed in the kernarg.
3421
3422     ".value_type"          string                    Unused and deprecated. This should no longer
3423                                                      be emitted, but is accepted for compatibility.
3424
3425     ".pointee_align"       integer                  Alignment in bytes of pointee
3426                                                     type for pointer type kernel
3427                                                     argument. Must be a power
3428                                                     of 2. Only present if
3429                                                     ".value_kind" is
3430                                                     "dynamic_shared_pointer".
3431     ".address_space"       string                   Kernel argument address space
3432                                                     qualifier. Only present if
3433                                                     ".value_kind" is "global_buffer" or
3434                                                     "dynamic_shared_pointer". Values
3435                                                     are:
3436
3437                                                     - "private"
3438                                                     - "global"
3439                                                     - "constant"
3440                                                     - "local"
3441                                                     - "generic"
3442                                                     - "region"
3443
3444                                                     .. TODO::
3445
3446                                                        Is "global_buffer" only "global"
3447                                                        or "constant"? Is
3448                                                        "dynamic_shared_pointer" always
3449                                                        "local"? Can HCC allow "generic"?
3450                                                        How can "private" or "region"
3451                                                        ever happen?
3452
3453     ".access"              string                   Kernel argument access
3454                                                     qualifier. Only present if
3455                                                     ".value_kind" is "image" or
3456                                                     "pipe". Values
3457                                                     are:
3458
3459                                                     - "read_only"
3460                                                     - "write_only"
3461                                                     - "read_write"
3462
3463                                                     .. TODO::
3464
3465                                                        Does this apply to
3466                                                        "global_buffer"?
3467
3468     ".actual_access"       string                   The actual memory accesses
3469                                                     performed by the kernel on the
3470                                                     kernel argument. Only present if
3471                                                     ".value_kind" is "global_buffer",
3472                                                     "image", or "pipe". This may be
3473                                                     more restrictive than indicated
3474                                                     by ".access" to reflect what the
3475                                                     kernel actual does. If not
3476                                                     present then the runtime must
3477                                                     assume what is implied by
3478                                                     ".access" and ".is_const"      . Values
3479                                                     are:
3480
3481                                                     - "read_only"
3482                                                     - "write_only"
3483                                                     - "read_write"
3484
3485     ".is_const"            boolean                  Indicates if the kernel argument
3486                                                     is const qualified. Only present
3487                                                     if ".value_kind" is
3488                                                     "global_buffer".
3489
3490     ".is_restrict"         boolean                  Indicates if the kernel argument
3491                                                     is restrict qualified. Only
3492                                                     present if ".value_kind" is
3493                                                     "global_buffer".
3494
3495     ".is_volatile"         boolean                  Indicates if the kernel argument
3496                                                     is volatile qualified. Only
3497                                                     present if ".value_kind" is
3498                                                     "global_buffer".
3499
3500     ".is_pipe"             boolean                  Indicates if the kernel argument
3501                                                     is pipe qualified. Only present
3502                                                     if ".value_kind" is "pipe".
3503
3504                                                     .. TODO::
3505
3506                                                        Can "global_buffer" be pipe
3507                                                        qualified?
3508
3509     ====================== ============== ========= ================================
3510
3511.. _amdgpu-amdhsa-code-object-metadata-v4:
3512
3513Code Object V4 Metadata
3514+++++++++++++++++++++++
3515
3516Code object V4 metadata is the same as
3517:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3518defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3519
3520  .. table:: AMDHSA Code Object V4 Metadata Map Changes
3521     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3522
3523     ================= ============== ========= =======================================
3524     String Key        Value Type     Required? Description
3525     ================= ============== ========= =======================================
3526     "amdhsa.version"  sequence of    Required  - The first integer is the major
3527                       2 integers                 version. Currently 1.
3528                                                - The second integer is the minor
3529                                                  version. Currently 1.
3530     "amdhsa.target"   string         Required  The target name of the code using the syntax:
3531
3532                                                .. code::
3533
3534                                                  <target-triple> [ "-" <target-id> ]
3535
3536                                                A canonical target ID must be
3537                                                used. See :ref:`amdgpu-target-triples`
3538                                                and :ref:`amdgpu-target-id`.
3539     ================= ============== ========= =======================================
3540
3541.. _amdgpu-amdhsa-code-object-metadata-v5:
3542
3543Code Object V5 Metadata
3544+++++++++++++++++++++++
3545
3546.. warning::
3547  Code object V5 is not the default code object version emitted by this version
3548  of LLVM.
3549
3550
3551Code object V5 metadata is the same as
3552:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3553:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table
3554:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3555
3556  .. table:: AMDHSA Code Object V5 Metadata Map Changes
3557     :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3558
3559     ================= ============== ========= =======================================
3560     String Key        Value Type     Required? Description
3561     ================= ============== ========= =======================================
3562     "amdhsa.version"  sequence of    Required  - The first integer is the major
3563                       2 integers                 version. Currently 1.
3564                                                - The second integer is the minor
3565                                                  version. Currently 2.
3566     ================= ============== ========= =======================================
3567
3568..
3569
3570  .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3571     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3572
3573     ====================== ============== ========= ================================
3574     String Key             Value Type     Required? Description
3575     ====================== ============== ========= ================================
3576     ".value_kind"          string         Required  Kernel argument kind that
3577                                                     specifies how to set up the
3578                                                     corresponding argument.
3579                                                     Values include:
3580                                                     the same as code object V3 metadata
3581                                                     (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3582                                                     with the following additions:
3583
3584                                                     "hidden_block_count_x"
3585                                                       The grid dispatch work-group count for the X dimension
3586                                                       is passed in the kernarg. Some languages, such as OpenCL,
3587                                                       support a last work-group in each dimension being partial.
3588                                                       This count only includes the non-partial work-group count.
3589                                                       This is not the same as the value in the AQL dispatch packet,
3590                                                       which has the grid size in work-items.
3591
3592                                                     "hidden_block_count_y"
3593                                                       The grid dispatch work-group count for the Y dimension
3594                                                       is passed in the kernarg. Some languages, such as OpenCL,
3595                                                       support a last work-group in each dimension being partial.
3596                                                       This count only includes the non-partial work-group count.
3597                                                       This is not the same as the value in the AQL dispatch packet,
3598                                                       which has the grid size in work-items. If the grid dimensionality
3599                                                       is 1, then must be 1.
3600
3601                                                     "hidden_block_count_z"
3602                                                       The grid dispatch work-group count for the Z dimension
3603                                                       is passed in the kernarg. Some languages, such as OpenCL,
3604                                                       support a last work-group in each dimension being partial.
3605                                                       This count only includes the non-partial work-group count.
3606                                                       This is not the same as the value in the AQL dispatch packet,
3607                                                       which has the grid size in work-items. If the grid dimensionality
3608                                                       is 1 or 2, then must be 1.
3609
3610                                                     "hidden_group_size_x"
3611                                                       The grid dispatch work-group size for the X dimension is
3612                                                       passed in the kernarg. This size only applies to the
3613                                                       non-partial work-groups. This is the same value as the AQL
3614                                                       dispatch packet work-group size.
3615
3616                                                     "hidden_group_size_y"
3617                                                       The grid dispatch work-group size for the Y dimension is
3618                                                       passed in the kernarg. This size only applies to the
3619                                                       non-partial work-groups. This is the same value as the AQL
3620                                                       dispatch packet work-group size. If the grid dimensionality
3621                                                       is 1, then must be 1.
3622
3623                                                     "hidden_group_size_z"
3624                                                       The grid dispatch work-group size for the Z dimension is
3625                                                       passed in the kernarg. This size only applies to the
3626                                                       non-partial work-groups. This is the same value as the AQL
3627                                                       dispatch packet work-group size. If the grid dimensionality
3628                                                       is 1 or 2, then must be 1.
3629
3630                                                     "hidden_remainder_x"
3631                                                       The grid dispatch work group size of the partial work group
3632                                                       of the X dimension, if it exists. Must be zero if a partial
3633                                                       work group does not exist in the X dimension.
3634
3635                                                     "hidden_remainder_y"
3636                                                       The grid dispatch work group size of the partial work group
3637                                                       of the Y dimension, if it exists. Must be zero if a partial
3638                                                       work group does not exist in the Y dimension.
3639
3640                                                     "hidden_remainder_z"
3641                                                       The grid dispatch work group size of the partial work group
3642                                                       of the Z dimension, if it exists. Must be zero if a partial
3643                                                       work group does not exist in the Z dimension.
3644
3645                                                     "hidden_grid_dims"
3646                                                       The grid dispatch dimensionality. This is the same value
3647                                                       as the AQL dispatch packet dimensionality. Must be a value
3648                                                       between 1 and 3.
3649
3650                                                     "hidden_heap_v1"
3651                                                       A global address space pointer to an initialized memory
3652                                                       buffer that conforms to the requirements of the malloc/free
3653                                                       device library V1 version implementation.
3654
3655                                                     "hidden_private_base"
3656                                                       The high 32 bits of the flat addressing private aperture base.
3657                                                       Only used by GFX8 to allow conversion between private segment
3658                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3659
3660                                                     "hidden_shared_base"
3661                                                       The high 32 bits of the flat addressing shared aperture base.
3662                                                       Only used by GFX8 to allow conversion between shared segment
3663                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3664
3665                                                     "hidden_queue_ptr"
3666                                                       A global memory address space pointer to the ROCm runtime
3667                                                       ``struct amd_queue_t`` structure for the HSA queue of the
3668                                                       associated dispatch AQL packet. It is only required for pre-GFX9
3669                                                       devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3670
3671     ====================== ============== ========= ================================
3672
3673..
3674
3675Kernel Dispatch
3676~~~~~~~~~~~~~~~
3677
3678The HSA architected queuing language (AQL) defines a user space memory interface
3679that can be used to control the dispatch of kernels, in an agent independent
3680way. An agent can have zero or more AQL queues created for it using an HSA
3681compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3682are 64 bytes) can be placed. See the *HSA Platform System Architecture
3683Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3684
3685The packet processor of a kernel agent is responsible for detecting and
3686dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3687packet processor is implemented by the hardware command processor (CP),
3688asynchronous dispatch controller (ADC) and shader processor input controller
3689(SPI).
3690
3691An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3692the kernel mode driver to initialize and register the AQL queue with CP.
3693
3694To dispatch a kernel the following actions are performed. This can occur in the
3695CPU host program, or from an HSA kernel executing on a GPU.
3696
36971. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3698   executed is obtained.
36992. A pointer to the kernel descriptor (see
3700   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3701   It must be for a kernel that is contained in a code object that was loaded
3702   by an HSA compatible runtime on the kernel agent with which the AQL queue is
3703   associated.
37043. Space is allocated for the kernel arguments using the HSA compatible runtime
3705   allocator for a memory region with the kernarg property for the kernel agent
3706   that will execute the kernel. It must be at least 16-byte aligned.
37074. Kernel argument values are assigned to the kernel argument memory
3708   allocation. The layout is defined in the *HSA Programmer's Language
3709   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3710   kernel argument memory in the same way constant memory is accessed. (Note
3711   that the HSA specification allows an implementation to copy the kernel
3712   argument contents to another location that is accessed by the kernel.)
37135. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3714   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3715   for the packet. The packet must be set up, and the final write must use an
3716   atomic store release to set the packet kind to ensure the packet contents are
3717   visible to the kernel agent. AQL defines a doorbell signal mechanism to
3718   notify the kernel agent that the AQL queue has been updated. These rules, and
3719   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3720   System Architecture Specification* [HSA]_.
37216. A kernel dispatch packet includes information about the actual dispatch,
3722   such as grid and work-group size, together with information from the code
3723   object about the kernel, such as segment sizes. The HSA compatible runtime
3724   queries on the kernel symbol can be used to obtain the code object values
3725   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
37267. CP executes micro-code and is responsible for detecting and setting up the
3727   GPU to execute the wavefronts of a kernel dispatch.
37288. CP ensures that when the a wavefront starts executing the kernel machine
3729   code, the scalar general purpose registers (SGPR) and vector general purpose
3730   registers (VGPR) are set up as required by the machine code. The required
3731   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3732   register state is defined in
3733   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
37349. The prolog of the kernel machine code (see
3735   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3736   before continuing executing the machine code that corresponds to the kernel.
373710. When the kernel dispatch has completed execution, CP signals the completion
3738    signal specified in the kernel dispatch packet if not 0.
3739
3740.. _amdgpu-amdhsa-memory-spaces:
3741
3742Memory Spaces
3743~~~~~~~~~~~~~
3744
3745The memory space properties are:
3746
3747  .. table:: AMDHSA Memory Spaces
3748     :name: amdgpu-amdhsa-memory-spaces-table
3749
3750     ================= =========== ======== ======= ==================
3751     Memory Space Name HSA Segment Hardware Address NULL Value
3752                       Name        Name     Size
3753     ================= =========== ======== ======= ==================
3754     Private           private     scratch  32      0x00000000
3755     Local             group       LDS      32      0xFFFFFFFF
3756     Global            global      global   64      0x0000000000000000
3757     Constant          constant    *same as 64      0x0000000000000000
3758                                   global*
3759     Generic           flat        flat     64      0x0000000000000000
3760     Region            N/A         GDS      32      *not implemented
3761                                                    for AMDHSA*
3762     ================= =========== ======== ======= ==================
3763
3764The global and constant memory spaces both use global virtual addresses, which
3765are the same virtual address space used by the CPU. However, some virtual
3766addresses may only be accessible to the CPU, some only accessible by the GPU,
3767and some by both.
3768
3769Using the constant memory space indicates that the data will not change during
3770the execution of the kernel. This allows scalar read instructions to be
3771used. The vector and scalar L1 caches are invalidated of volatile data before
3772each kernel dispatch execution to allow constant memory to change values between
3773kernel dispatches.
3774
3775The local memory space uses the hardware Local Data Store (LDS) which is
3776automatically allocated when the hardware creates work-groups of wavefronts, and
3777freed when all the wavefronts of a work-group have terminated. The data store
3778(DS) instructions can be used to access it.
3779
3780The private memory space uses the hardware scratch memory support. If the kernel
3781uses scratch, then the hardware allocates memory that is accessed using
3782wavefront lane dword (4 byte) interleaving. The mapping used from private
3783address to physical address is:
3784
3785  ``wavefront-scratch-base +
3786  (private-address * wavefront-size * 4) +
3787  (wavefront-lane-id * 4)``
3788
3789There are different ways that the wavefront scratch base address is determined
3790by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3791memory can be accessed in an interleaved manner using buffer instruction with
3792the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3793instructions, or by flat instructions. If each lane of a wavefront accesses the
3794same private address, the interleaving results in adjacent dwords being accessed
3795and hence requires fewer cache lines to be fetched. Multi-dword access is not
3796supported except by flat and scratch instructions in GFX9-GFX11.
3797
3798The generic address space uses the hardware flat address support available in
3799GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3800local apertures), that are outside the range of addressible global memory, to
3801map from a flat address to a private or local address.
3802
3803FLAT instructions can take a flat address and access global, private (scratch)
3804and group (LDS) memory depending on if the address is within one of the
3805aperture ranges. Flat access to scratch requires hardware aperture setup and
3806setup in the kernel prologue (see
3807:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3808hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3809:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3810
3811To convert between a segment address and a flat address the base address of the
3812apertures address can be used. For GFX7-GFX8 these are available in the
3813:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3814Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3815GFX9-GFX11 the aperture base addresses are directly available as inline constant
3816registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3817address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3818which makes it easier to convert from flat to segment or segment to flat.
3819
3820Image and Samplers
3821~~~~~~~~~~~~~~~~~~
3822
3823Image and sample handles created by an HSA compatible runtime (see
3824:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3825object respectively. In order to support the HSA ``query_sampler`` operations
3826two extra dwords are used to store the HSA BRIG enumeration values for the
3827queries that are not trivially deducible from the S# representation.
3828
3829HSA Signals
3830~~~~~~~~~~~
3831
3832HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3833are 64-bit addresses of a structure allocated in memory accessible from both the
3834CPU and GPU. The structure is defined by the runtime and subject to change
3835between releases. For example, see [AMD-ROCm-github]_.
3836
3837.. _amdgpu-amdhsa-hsa-aql-queue:
3838
3839HSA AQL Queue
3840~~~~~~~~~~~~~
3841
3842The HSA AQL queue structure is defined by an HSA compatible runtime (see
3843:ref:`amdgpu-os`) and subject to change between releases. For example, see
3844[AMD-ROCm-github]_. For some processors it contains fields needed to implement
3845certain language features such as the flat address aperture bases. It also
3846contains fields used by CP such as managing the allocation of scratch memory.
3847
3848.. _amdgpu-amdhsa-kernel-descriptor:
3849
3850Kernel Descriptor
3851~~~~~~~~~~~~~~~~~
3852
3853A kernel descriptor consists of the information needed by CP to initiate the
3854execution of a kernel, including the entry point address of the machine code
3855that implements the kernel.
3856
3857Code Object V3 Kernel Descriptor
3858++++++++++++++++++++++++++++++++
3859
3860CP microcode requires the Kernel descriptor to be allocated on 64-byte
3861alignment.
3862
3863The fields used by CP for code objects before V3 also match those specified in
3864:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3865
3866  .. table:: Code Object V3 Kernel Descriptor
3867     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3868
3869     ======= ======= =============================== ============================
3870     Bits    Size    Field Name                      Description
3871     ======= ======= =============================== ============================
3872     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3873                                                     address space memory
3874                                                     required for a work-group
3875                                                     in bytes. This does not
3876                                                     include any dynamically
3877                                                     allocated local address
3878                                                     space memory that may be
3879                                                     added when the kernel is
3880                                                     dispatched.
3881     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3882                                                     private address space
3883                                                     memory required for a
3884                                                     work-item in bytes.
3885                                                     Additional space may need to
3886                                                     be added to this value if
3887                                                     the call stack has
3888                                                     non-inlined function calls.
3889     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3890                                                     memory pointed to by the
3891                                                     AQL dispatch packet. The
3892                                                     kernarg memory is used to
3893                                                     pass arguments to the
3894                                                     kernel.
3895
3896                                                     * If the kernarg pointer in
3897                                                       the dispatch packet is NULL
3898                                                       then there are no kernel
3899                                                       arguments.
3900                                                     * If the kernarg pointer in
3901                                                       the dispatch packet is
3902                                                       not NULL and this value
3903                                                       is 0 then the kernarg
3904                                                       memory size is
3905                                                       unspecified.
3906                                                     * If the kernarg pointer in
3907                                                       the dispatch packet is
3908                                                       not NULL and this value
3909                                                       is not 0 then the value
3910                                                       specifies the kernarg
3911                                                       memory size in bytes. It
3912                                                       is recommended to provide
3913                                                       a value as it may be used
3914                                                       by CP to optimize making
3915                                                       the kernarg memory
3916                                                       visible to the kernel
3917                                                       code.
3918
3919     127:96  4 bytes                                 Reserved, must be 0.
3920     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3921                                                     negative) from base
3922                                                     address of kernel
3923                                                     descriptor to kernel's
3924                                                     entry point instruction
3925                                                     which must be 256 byte
3926                                                     aligned.
3927     351:272 20                                      Reserved, must be 0.
3928             bytes
3929     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3930                                                       Reserved, must be 0.
3931                                                     GFX90A, GFX940
3932                                                       Compute Shader (CS)
3933                                                       program settings used by
3934                                                       CP to set up
3935                                                       ``COMPUTE_PGM_RSRC3``
3936                                                       configuration
3937                                                       register. See
3938                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3939                                                     GFX10-GFX11
3940                                                       Compute Shader (CS)
3941                                                       program settings used by
3942                                                       CP to set up
3943                                                       ``COMPUTE_PGM_RSRC3``
3944                                                       configuration
3945                                                       register. See
3946                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
3947     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3948                                                     program settings used by
3949                                                     CP to set up
3950                                                     ``COMPUTE_PGM_RSRC1``
3951                                                     configuration
3952                                                     register. See
3953                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
3954     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3955                                                     program settings used by
3956                                                     CP to set up
3957                                                     ``COMPUTE_PGM_RSRC2``
3958                                                     configuration
3959                                                     register. See
3960                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
3961     458:448 7 bits  *See separate bits below.*      Enable the setup of the
3962                                                     SGPR user data registers
3963                                                     (see
3964                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3965
3966                                                     The total number of SGPR
3967                                                     user data registers
3968                                                     requested must not exceed
3969                                                     16 and match value in
3970                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3971                                                     Any requests beyond 16
3972                                                     will be ignored.
3973     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3974                     _BUFFER                         column of
3975                                                     :ref:`amdgpu-processor-table`
3976                                                     specifies *Architected flat
3977                                                     scratch* then not supported
3978                                                     and must be 0,
3979     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3980     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3981     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3982     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3983     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3984                                                     column of
3985                                                     :ref:`amdgpu-processor-table`
3986                                                     specifies *Architected flat
3987                                                     scratch* then not supported
3988                                                     and must be 0,
3989     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3990                     _SIZE
3991     457:455 3 bits                                  Reserved, must be 0.
3992     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3993                                                       Reserved, must be 0.
3994                                                     GFX10-GFX11
3995                                                       - If 0 execute in
3996                                                         wavefront size 64 mode.
3997                                                       - If 1 execute in
3998                                                         native wavefront size
3999                                                         32 mode.
4000     463:459 1 bit                                   Reserved, must be 0.
4001     464     1 bit   RESERVED_464                    Deprecated, must be 0.
4002     467:465 3 bits                                  Reserved, must be 0.
4003     468     1 bit   RESERVED_468                    Deprecated, must be 0.
4004     469:471 3 bits                                  Reserved, must be 0.
4005     511:472 5 bytes                                 Reserved, must be 0.
4006     512     **Total size 64 bytes.**
4007     ======= ====================================================================
4008
4009..
4010
4011  .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4012     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4013
4014     ======= ======= =============================== ===========================================================================
4015     Bits    Size    Field Name                      Description
4016     ======= ======= =============================== ===========================================================================
4017     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
4018                                                     blocks used by each work-item;
4019                                                     granularity is device
4020                                                     specific:
4021
4022                                                     GFX6-GFX9
4023                                                       - vgprs_used 0..256
4024                                                       - max(0, ceil(vgprs_used / 4) - 1)
4025                                                     GFX90A, GFX940
4026                                                       - vgprs_used 0..512
4027                                                       - vgprs_used = align(arch_vgprs, 4)
4028                                                                      + acc_vgprs
4029                                                       - max(0, ceil(vgprs_used / 8) - 1)
4030                                                     GFX10-GFX11 (wavefront size 64)
4031                                                       - max_vgpr 1..256
4032                                                       - max(0, ceil(vgprs_used / 4) - 1)
4033                                                     GFX10-GFX11 (wavefront size 32)
4034                                                       - max_vgpr 1..256
4035                                                       - max(0, ceil(vgprs_used / 8) - 1)
4036
4037                                                     Where vgprs_used is defined
4038                                                     as the highest VGPR number
4039                                                     explicitly referenced plus
4040                                                     one.
4041
4042                                                     Used by CP to set up
4043                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
4044
4045                                                     The
4046                                                     :ref:`amdgpu-assembler`
4047                                                     calculates this
4048                                                     automatically for the
4049                                                     selected processor from
4050                                                     values provided to the
4051                                                     `.amdhsa_kernel` directive
4052                                                     by the
4053                                                     `.amdhsa_next_free_vgpr`
4054                                                     nested directive (see
4055                                                     :ref:`amdhsa-kernel-directives-table`).
4056     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4057                                                     blocks used by a wavefront;
4058                                                     granularity is device
4059                                                     specific:
4060
4061                                                     GFX6-GFX8
4062                                                       - sgprs_used 0..112
4063                                                       - max(0, ceil(sgprs_used / 8) - 1)
4064                                                     GFX9
4065                                                       - sgprs_used 0..112
4066                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
4067                                                     GFX10-GFX11
4068                                                       Reserved, must be 0.
4069                                                       (128 SGPRs always
4070                                                       allocated.)
4071
4072                                                     Where sgprs_used is
4073                                                     defined as the highest
4074                                                     SGPR number explicitly
4075                                                     referenced plus one, plus
4076                                                     a target specific number
4077                                                     of additional special
4078                                                     SGPRs for VCC,
4079                                                     FLAT_SCRATCH (GFX7+) and
4080                                                     XNACK_MASK (GFX8+), and
4081                                                     any additional
4082                                                     target specific
4083                                                     limitations. It does not
4084                                                     include the 16 SGPRs added
4085                                                     if a trap handler is
4086                                                     enabled.
4087
4088                                                     The target specific
4089                                                     limitations and special
4090                                                     SGPR layout are defined in
4091                                                     the hardware
4092                                                     documentation, which can
4093                                                     be found in the
4094                                                     :ref:`amdgpu-processors`
4095                                                     table.
4096
4097                                                     Used by CP to set up
4098                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
4099
4100                                                     The
4101                                                     :ref:`amdgpu-assembler`
4102                                                     calculates this
4103                                                     automatically for the
4104                                                     selected processor from
4105                                                     values provided to the
4106                                                     `.amdhsa_kernel` directive
4107                                                     by the
4108                                                     `.amdhsa_next_free_sgpr`
4109                                                     and `.amdhsa_reserve_*`
4110                                                     nested directives (see
4111                                                     :ref:`amdhsa-kernel-directives-table`).
4112     11:10   2 bits  PRIORITY                        Must be 0.
4113
4114                                                     Start executing wavefront
4115                                                     at the specified priority.
4116
4117                                                     CP is responsible for
4118                                                     filling in
4119                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
4120     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
4121                                                     with specified rounding
4122                                                     mode for single (32
4123                                                     bit) floating point
4124                                                     precision floating point
4125                                                     operations.
4126
4127                                                     Floating point rounding
4128                                                     mode values are defined in
4129                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4130
4131                                                     Used by CP to set up
4132                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4133     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
4134                                                     with specified rounding
4135                                                     denorm mode for half/double (16
4136                                                     and 64-bit) floating point
4137                                                     precision floating point
4138                                                     operations.
4139
4140                                                     Floating point rounding
4141                                                     mode values are defined in
4142                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4143
4144                                                     Used by CP to set up
4145                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4146     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
4147                                                     with specified denorm mode
4148                                                     for single (32
4149                                                     bit)  floating point
4150                                                     precision floating point
4151                                                     operations.
4152
4153                                                     Floating point denorm mode
4154                                                     values are defined in
4155                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4156
4157                                                     Used by CP to set up
4158                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4159     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
4160                                                     with specified denorm mode
4161                                                     for half/double (16
4162                                                     and 64-bit) floating point
4163                                                     precision floating point
4164                                                     operations.
4165
4166                                                     Floating point denorm mode
4167                                                     values are defined in
4168                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4169
4170                                                     Used by CP to set up
4171                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4172     20      1 bit   PRIV                            Must be 0.
4173
4174                                                     Start executing wavefront
4175                                                     in privilege trap handler
4176                                                     mode.
4177
4178                                                     CP is responsible for
4179                                                     filling in
4180                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
4181     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
4182                                                     with DX10 clamp mode
4183                                                     enabled. Used by the vector
4184                                                     ALU to force DX10 style
4185                                                     treatment of NaN's (when
4186                                                     set, clamp NaN to zero,
4187                                                     otherwise pass NaN
4188                                                     through).
4189
4190                                                     Used by CP to set up
4191                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4192     22      1 bit   DEBUG_MODE                      Must be 0.
4193
4194                                                     Start executing wavefront
4195                                                     in single step mode.
4196
4197                                                     CP is responsible for
4198                                                     filling in
4199                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4200     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
4201                                                     with IEEE mode
4202                                                     enabled. Floating point
4203                                                     opcodes that support
4204                                                     exception flag gathering
4205                                                     will quiet and propagate
4206                                                     signaling-NaN inputs per
4207                                                     IEEE 754-2008. Min_dx10 and
4208                                                     max_dx10 become IEEE
4209                                                     754-2008 compliant due to
4210                                                     signaling-NaN propagation
4211                                                     and quieting.
4212
4213                                                     Used by CP to set up
4214                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4215     24      1 bit   BULKY                           Must be 0.
4216
4217                                                     Only one work-group allowed
4218                                                     to execute on a compute
4219                                                     unit.
4220
4221                                                     CP is responsible for
4222                                                     filling in
4223                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
4224     25      1 bit   CDBG_USER                       Must be 0.
4225
4226                                                     Flag that can be used to
4227                                                     control debugging code.
4228
4229                                                     CP is responsible for
4230                                                     filling in
4231                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4232     26      1 bit   FP16_OVFL                       GFX6-GFX8
4233                                                       Reserved, must be 0.
4234                                                     GFX9-GFX11
4235                                                       Wavefront starts execution
4236                                                       with specified fp16 overflow
4237                                                       mode.
4238
4239                                                       - If 0, fp16 overflow generates
4240                                                         +/-INF values.
4241                                                       - If 1, fp16 overflow that is the
4242                                                         result of an +/-INF input value
4243                                                         or divide by 0 produces a +/-INF,
4244                                                         otherwise clamps computed
4245                                                         overflow to +/-MAX_FP16 as
4246                                                         appropriate.
4247
4248                                                       Used by CP to set up
4249                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4250     28:27   2 bits                                  Reserved, must be 0.
4251     29      1 bit    WGP_MODE                       GFX6-GFX9
4252                                                       Reserved, must be 0.
4253                                                     GFX10-GFX11
4254                                                       - If 0 execute work-groups in
4255                                                         CU wavefront execution mode.
4256                                                       - If 1 execute work-groups on
4257                                                         in WGP wavefront execution mode.
4258
4259                                                       See :ref:`amdgpu-amdhsa-memory-model`.
4260
4261                                                       Used by CP to set up
4262                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4263     30      1 bit    MEM_ORDERED                    GFX6-GFX9
4264                                                       Reserved, must be 0.
4265                                                     GFX10-GFX11
4266                                                       Controls the behavior of the
4267                                                       s_waitcnt's vmcnt and vscnt
4268                                                       counters.
4269
4270                                                       - If 0 vmcnt reports completion
4271                                                         of load and atomic with return
4272                                                         out of order with sample
4273                                                         instructions, and the vscnt
4274                                                         reports the completion of
4275                                                         store and atomic without
4276                                                         return in order.
4277                                                       - If 1 vmcnt reports completion
4278                                                         of load, atomic with return
4279                                                         and sample instructions in
4280                                                         order, and the vscnt reports
4281                                                         the completion of store and
4282                                                         atomic without return in order.
4283
4284                                                       Used by CP to set up
4285                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4286     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4287                                                       Reserved, must be 0.
4288                                                     GFX10-GFX11
4289                                                       - If 0 execute SIMD wavefronts
4290                                                         using oldest first policy.
4291                                                       - If 1 execute SIMD wavefronts to
4292                                                         ensure wavefronts will make some
4293                                                         forward progress.
4294
4295                                                       Used by CP to set up
4296                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4297     32      **Total size 4 bytes**
4298     ======= ===================================================================================================================
4299
4300..
4301
4302  .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4303     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4304
4305     ======= ======= =============================== ===========================================================================
4306     Bits    Size    Field Name                      Description
4307     ======= ======= =============================== ===========================================================================
4308     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4309                                                       private segment.
4310                                                     * If the *Target Properties*
4311                                                       column of
4312                                                       :ref:`amdgpu-processor-table`
4313                                                       does not specify
4314                                                       *Architected flat
4315                                                       scratch* then enable the
4316                                                       setup of the SGPR
4317                                                       wavefront scratch offset
4318                                                       system register (see
4319                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4320                                                     * If the *Target Properties*
4321                                                       column of
4322                                                       :ref:`amdgpu-processor-table`
4323                                                       specifies *Architected
4324                                                       flat scratch* then enable
4325                                                       the setup of the
4326                                                       FLAT_SCRATCH register
4327                                                       pair (see
4328                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4329
4330                                                     Used by CP to set up
4331                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4332     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4333                                                     user data
4334                                                     registers requested. This
4335                                                     number must be greater than
4336                                                     or equal to the number of user
4337                                                     data registers enabled.
4338
4339                                                     Used by CP to set up
4340                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4341     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4342
4343                                                     This bit represents
4344                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4345                                                     which is set by the CP if
4346                                                     the runtime has installed a
4347                                                     trap handler.
4348     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4349                                                     system SGPR register for
4350                                                     the work-group id in the X
4351                                                     dimension (see
4352                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4353
4354                                                     Used by CP to set up
4355                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4356     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4357                                                     system SGPR register for
4358                                                     the work-group id in the Y
4359                                                     dimension (see
4360                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4361
4362                                                     Used by CP to set up
4363                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4364     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4365                                                     system SGPR register for
4366                                                     the work-group id in the Z
4367                                                     dimension (see
4368                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4369
4370                                                     Used by CP to set up
4371                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4372     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4373                                                     system SGPR register for
4374                                                     work-group information (see
4375                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4376
4377                                                     Used by CP to set up
4378                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4379     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4380                                                     VGPR system registers used
4381                                                     for the work-item ID.
4382                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4383                                                     defines the values.
4384
4385                                                     Used by CP to set up
4386                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4387     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4388
4389                                                     Wavefront starts execution
4390                                                     with address watch
4391                                                     exceptions enabled which
4392                                                     are generated when L1 has
4393                                                     witnessed a thread access
4394                                                     an *address of
4395                                                     interest*.
4396
4397                                                     CP is responsible for
4398                                                     filling in the address
4399                                                     watch bit in
4400                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4401                                                     according to what the
4402                                                     runtime requests.
4403     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4404
4405                                                     Wavefront starts execution
4406                                                     with memory violation
4407                                                     exceptions exceptions
4408                                                     enabled which are generated
4409                                                     when a memory violation has
4410                                                     occurred for this wavefront from
4411                                                     L1 or LDS
4412                                                     (write-to-read-only-memory,
4413                                                     mis-aligned atomic, LDS
4414                                                     address out of range,
4415                                                     illegal address, etc.).
4416
4417                                                     CP sets the memory
4418                                                     violation bit in
4419                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4420                                                     according to what the
4421                                                     runtime requests.
4422     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4423
4424                                                     CP uses the rounded value
4425                                                     from the dispatch packet,
4426                                                     not this value, as the
4427                                                     dispatch may contain
4428                                                     dynamically allocated group
4429                                                     segment memory. CP writes
4430                                                     directly to
4431                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4432
4433                                                     Amount of group segment
4434                                                     (LDS) to allocate for each
4435                                                     work-group. Granularity is
4436                                                     device specific:
4437
4438                                                     GFX6
4439                                                       roundup(lds-size / (64 * 4))
4440                                                     GFX7-GFX11
4441                                                       roundup(lds-size / (128 * 4))
4442
4443     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4444                     _INVALID_OPERATION              with specified exceptions
4445                                                     enabled.
4446
4447                                                     Used by CP to set up
4448                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
4449                                                     (set from bits 0..6).
4450
4451                                                     IEEE 754 FP Invalid
4452                                                     Operation
4453     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4454                     _SOURCE                         input operands is a
4455                                                     denormal number
4456     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4457                     _DIVISION_BY_ZERO               Zero
4458     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4459                     _OVERFLOW
4460     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4461                     _UNDERFLOW
4462     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4463                     _INEXACT
4464     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4465                     _ZERO                           (rcp_iflag_f32 instruction
4466                                                     only)
4467     31      1 bit                                   Reserved, must be 0.
4468     32      **Total size 4 bytes.**
4469     ======= ===================================================================================================================
4470
4471..
4472
4473  .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4474     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4475
4476     ======= ======= =============================== ===========================================================================
4477     Bits    Size    Field Name                      Description
4478     ======= ======= =============================== ===========================================================================
4479     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4480                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4481                                                     63 - accum-offset = 256.
4482     6:15    10                                      Reserved, must be 0.
4483             bits
4484     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4485                                                       launched in the same CU.
4486                                                     - If 1 the waves of a work-group can be
4487                                                       launched in different CUs. The waves
4488                                                       cannot use S_BARRIER or LDS.
4489     17:31   15                                      Reserved, must be 0.
4490             bits
4491     32      **Total size 4 bytes.**
4492     ======= ===================================================================================================================
4493
4494..
4495
4496  .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4497     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4498
4499     ======= ======= =============================== ===========================================================================
4500     Bits    Size    Field Name                      Description
4501     ======= ======= =============================== ===========================================================================
4502     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
4503                                                     wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4504                                                     of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4505                                                     not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4506     9:4     6 bits  INST_PREF_SIZE                  GFX10
4507                                                       Reserved, must be 0.
4508                                                     GFX11
4509                                                       Number of instruction bytes to prefetch, starting at the kernel's entry
4510                                                       point instruction, before wavefront starts execution. The value is 0..63
4511                                                       with a granularity of 128 bytes.
4512     10      1 bit   TRAP_ON_START                   GFX10
4513                                                       Reserved, must be 0.
4514                                                     GFX11
4515                                                       Must be 0.
4516
4517                                                       If 1, wavefront starts execution by trapping into the trap handler.
4518
4519                                                       CP is responsible for filling in the trap on start bit in
4520                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4521                                                       requests.
4522     11      1 bit   TRAP_ON_END                     GFX10
4523                                                       Reserved, must be 0.
4524                                                     GFX11
4525                                                       Must be 0.
4526
4527                                                       If 1, wavefront execution terminates by trapping into the trap handler.
4528
4529                                                       CP is responsible for filling in the trap on end bit in
4530                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4531     30:12   19 bits                                 Reserved, must be 0.
4532     31      1 bit   IMAGE_OP                        GFX10
4533                                                       Reserved, must be 0.
4534                                                     GFX11
4535                                                       If 1, the kernel execution contains image instructions. If executed as
4536                                                       part of a graphics pipeline, image read instructions will stall waiting
4537                                                       for any necessary ``WAIT_SYNC`` fence to be performed in order to
4538                                                       indicate that earlier pipeline stages have completed writing to the
4539                                                       image.
4540
4541                                                       Not used for compute kernels that are not part of a graphics pipeline and
4542                                                       must be 0.
4543     32      **Total size 4 bytes.**
4544     ======= ===================================================================================================================
4545
4546..
4547
4548  .. table:: Floating Point Rounding Mode Enumeration Values
4549     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4550
4551     ====================================== ===== ==============================
4552     Enumeration Name                       Value Description
4553     ====================================== ===== ==============================
4554     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4555     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4556     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4557     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4558     ====================================== ===== ==============================
4559
4560..
4561
4562  .. table:: Floating Point Denorm Mode Enumeration Values
4563     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4564
4565     ====================================== ===== ==============================
4566     Enumeration Name                       Value Description
4567     ====================================== ===== ==============================
4568     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4569                                                  Denorms
4570     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4571     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4572     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4573     ====================================== ===== ==============================
4574
4575..
4576
4577  .. table:: System VGPR Work-Item ID Enumeration Values
4578     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4579
4580     ======================================== ===== ============================
4581     Enumeration Name                         Value Description
4582     ======================================== ===== ============================
4583     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4584                                                    ID.
4585     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4586                                                    dimensions ID.
4587     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4588                                                    dimensions ID.
4589     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4590     ======================================== ===== ============================
4591
4592.. _amdgpu-amdhsa-initial-kernel-execution-state:
4593
4594Initial Kernel Execution State
4595~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4596
4597This section defines the register state that will be set up by the packet
4598processor prior to the start of execution of every wavefront. This is limited by
4599the constraints of the hardware controllers of CP/ADC/SPI.
4600
4601The order of the SGPR registers is defined, but the compiler can specify which
4602ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4603fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4604for enabled registers are dense starting at SGPR0: the first enabled register is
4605SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4606an SGPR number.
4607
4608The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4609all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4610using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4611actually initialized. These are then immediately followed by the System SGPRs
4612that are set up by ADC/SPI and can have different values for each wavefront of
4613the grid dispatch.
4614
4615SGPR register initial state is defined in
4616:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4617
4618  .. table:: SGPR Register Set Up Order
4619     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4620
4621     ========== ========================== ====== ==============================
4622     SGPR Order Name                       Number Description
4623                (kernel descriptor enable  of
4624                field)                     SGPRs
4625     ========== ========================== ====== ==============================
4626     First      Private Segment Buffer     4      See
4627                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4628                _segment_buffer)
4629     then       Dispatch Ptr               2      64-bit address of AQL dispatch
4630                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4631                                                  actually executing.
4632     then       Queue Ptr                  2      64-bit address of amd_queue_t
4633                (enable_sgpr_queue_ptr)           object for AQL queue on which
4634                                                  the dispatch packet was
4635                                                  queued.
4636     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4637                (enable_sgpr_kernarg              segment. This is directly
4638                _segment_ptr)                     copied from the
4639                                                  kernarg_address in the kernel
4640                                                  dispatch packet.
4641
4642                                                  Having CP load it once avoids
4643                                                  loading it at the beginning of
4644                                                  every wavefront.
4645     then       Dispatch Id                2      64-bit Dispatch ID of the
4646                (enable_sgpr_dispatch_id)         dispatch packet being
4647                                                  executed.
4648     then       Flat Scratch Init          2      See
4649                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4650                _init)
4651     then       Private Segment Size       1      The 32-bit byte size of a
4652                (enable_sgpr_private              single work-item's memory
4653                _segment_size)                    allocation. This is the
4654                                                  value from the kernel
4655                                                  dispatch packet Private
4656                                                  Segment Byte Size rounded up
4657                                                  by CP to a multiple of
4658                                                  DWORD.
4659
4660                                                  Having CP load it once avoids
4661                                                  loading it at the beginning of
4662                                                  every wavefront.
4663
4664                                                  This is not used for
4665                                                  GFX7-GFX8 since it is the same
4666                                                  value as the second SGPR of
4667                                                  Flat Scratch Init. However, it
4668                                                  may be needed for GFX9-GFX11 which
4669                                                  changes the meaning of the
4670                                                  Flat Scratch Init value.
4671     then       Work-Group Id X            1      32-bit work-group id in X
4672                (enable_sgpr_workgroup_id         dimension of grid for
4673                _X)                               wavefront.
4674     then       Work-Group Id Y            1      32-bit work-group id in Y
4675                (enable_sgpr_workgroup_id         dimension of grid for
4676                _Y)                               wavefront.
4677     then       Work-Group Id Z            1      32-bit work-group id in Z
4678                (enable_sgpr_workgroup_id         dimension of grid for
4679                _Z)                               wavefront.
4680     then       Work-Group Info            1      {first_wavefront, 14'b0000,
4681                (enable_sgpr_workgroup            ordered_append_term[10:0],
4682                _info)                            threadgroup_size_in_wavefronts[5:0]}
4683     then       Scratch Wavefront Offset   1      See
4684                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4685                _segment_wavefront_offset)        and
4686                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4687     ========== ========================== ====== ==============================
4688
4689The order of the VGPR registers is defined, but the compiler can specify which
4690ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4691fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4692for enabled registers are dense starting at VGPR0: the first enabled register is
4693VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4694VGPR number.
4695
4696There are different methods used for the VGPR initial state:
4697
4698* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4699  specifies otherwise, a separate VGPR register is used per work-item ID. The
4700  VGPR register initial state for this method is defined in
4701  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4702* If *Target Properties* column of :ref:`amdgpu-processor-table`
4703  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4704  for all work-item IDs. The register layout for this method is defined in
4705  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4706
4707  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4708     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4709
4710     ========== ========================== ====== ==============================
4711     VGPR Order Name                       Number Description
4712                (kernel descriptor enable  of
4713                field)                     VGPRs
4714     ========== ========================== ====== ==============================
4715     First      Work-Item Id X             1      32-bit work-item id in X
4716                (Always initialized)              dimension of work-group for
4717                                                  wavefront lane.
4718     then       Work-Item Id Y             1      32-bit work-item id in Y
4719                (enable_vgpr_workitem_id          dimension of work-group for
4720                > 0)                              wavefront lane.
4721     then       Work-Item Id Z             1      32-bit work-item id in Z
4722                (enable_vgpr_workitem_id          dimension of work-group for
4723                > 1)                              wavefront lane.
4724     ========== ========================== ====== ==============================
4725
4726..
4727
4728  .. table:: Register Layout for Packed Work-Item ID Method
4729     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4730
4731     ======= ======= ================ =========================================
4732     Bits    Size    Field Name       Description
4733     ======= ======= ================ =========================================
4734     0:9     10 bits Work-Item Id X   Work-item id in X
4735                                      dimension of work-group for
4736                                      wavefront lane.
4737
4738                                      Always initialized.
4739
4740     10:19   10 bits Work-Item Id Y   Work-item id in Y
4741                                      dimension of work-group for
4742                                      wavefront lane.
4743
4744                                      Initialized if enable_vgpr_workitem_id >
4745                                      0, otherwise set to 0.
4746     20:29   10 bits Work-Item Id Z   Work-item id in Z
4747                                      dimension of work-group for
4748                                      wavefront lane.
4749
4750                                      Initialized if enable_vgpr_workitem_id >
4751                                      1, otherwise set to 0.
4752     30:31   2 bits                   Reserved, set to 0.
4753     ======= ======= ================ =========================================
4754
4755The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4756
47571. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4758   registers.
47592. Work-group Id registers X, Y, Z are set by ADC which supports any
4760   combination including none.
47613. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4762   its value cannot be included with the flat scratch init value which is per
4763   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
47644. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4765   or (X, Y, Z).
47665. Flat Scratch register pair initialization is described in
4767   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4768
4769The global segment can be accessed either using buffer instructions (GFX6 which
4770has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4771instructions (GFX9-GFX11).
4772
4773If buffer operations are used, then the compiler can generate a V# with the
4774following properties:
4775
4776* base address of 0
4777* no swizzle
4778* ATC: 1 if IOMMU present (such as APU)
4779* ptr64: 1
4780* MTYPE set to support memory coherence that matches the runtime (such as CC for
4781  APU and NC for dGPU).
4782
4783.. _amdgpu-amdhsa-kernel-prolog:
4784
4785Kernel Prolog
4786~~~~~~~~~~~~~
4787
4788The compiler performs initialization in the kernel prologue depending on the
4789target and information about things like stack usage in the kernel and called
4790functions. Some of this initialization requires the compiler to request certain
4791User and System SGPRs be present in the
4792:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4793:ref:`amdgpu-amdhsa-kernel-descriptor`.
4794
4795.. _amdgpu-amdhsa-kernel-prolog-cfi:
4796
4797CFI
4798+++
4799
48001.  The CFI return address is undefined.
4801
48022.  The CFI CFA is defined using an expression which evaluates to a location
4803    description that comprises one memory location description for the
4804    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4805
4806.. _amdgpu-amdhsa-kernel-prolog-m0:
4807
4808M0
4809++
4810
4811GFX6-GFX8
4812  The M0 register must be initialized with a value at least the total LDS size
4813  if the kernel may access LDS via DS or flat operations. Total LDS size is
4814  available in dispatch packet. For M0, it is also possible to use maximum
4815  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4816  GFX7-GFX8).
4817GFX9-GFX11
4818  The M0 register is not used for range checking LDS accesses and so does not
4819  need to be initialized in the prolog.
4820
4821.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4822
4823Stack Pointer
4824+++++++++++++
4825
4826If the kernel has function calls it must set up the ABI stack pointer described
4827in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4828SGPR32 to the unswizzled scratch offset of the address past the last local
4829allocation.
4830
4831.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4832
4833Frame Pointer
4834+++++++++++++
4835
4836If the kernel needs a frame pointer for the reasons defined in
4837``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4838kernel prolog. If a frame pointer is not required then all uses of the frame
4839pointer are replaced with immediate ``0`` offsets.
4840
4841.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4842
4843Flat Scratch
4844++++++++++++
4845
4846There are different methods used for initializing flat scratch:
4847
4848* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4849  specifies *Does not support generic address space*:
4850
4851  Flat scratch is not supported and there is no flat scratch register pair.
4852
4853* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4854  specifies *Offset flat scratch*:
4855
4856  If the kernel or any function it calls may use flat operations to access
4857  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4858  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4859  Scratch Wavefront Offset SGPR registers (see
4860  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4861
4862  1. The low word of Flat Scratch Init is the 32-bit byte offset from
4863     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4864     being managed by SPI for the queue executing the kernel dispatch. This is
4865     the same value used in the Scratch Segment Buffer V# base address.
4866
4867     CP obtains this from the runtime. (The Scratch Segment Buffer base address
4868     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4869
4870     The prolog must add the value of Scratch Wavefront Offset to get the
4871     wavefront's byte scratch backing memory offset from
4872     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4873
4874     The Scratch Wavefront Offset must also be used as an offset with Private
4875     segment address when using the Scratch Segment Buffer.
4876
4877     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4878     shifted by 8 before moving into FLAT_SCRATCH_HI.
4879
4880     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4881     SGPRn is the highest numbered SGPR allocated to the wavefront).
4882     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4883     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4884     FLAT SCRATCH BASE in flat memory instructions that access the scratch
4885     aperture.
4886  2. The second word of Flat Scratch Init is 32-bit byte size of a single
4887     work-items scratch memory usage.
4888
4889     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4890     checks that the value in the kernel dispatch packet Private Segment Byte
4891     Size is not larger and requests the runtime to increase the queue's scratch
4892     size if necessary.
4893
4894     CP directly loads from the kernel dispatch packet Private Segment Byte Size
4895     field and rounds up to a multiple of DWORD. Having CP load it once avoids
4896     loading it at the beginning of every wavefront.
4897
4898     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4899     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4900     in flat memory instructions.
4901
4902* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4903  specifies *Absolute flat scratch*:
4904
4905  If the kernel or any function it calls may use flat operations to access
4906  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4907  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4908  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4909  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4910
4911  The Flat Scratch Init is the 64-bit address of the base of scratch backing
4912  memory being managed by SPI for the queue executing the kernel dispatch.
4913
4914  CP obtains this from the runtime.
4915
4916  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4917  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4918  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4919  memory instructions.
4920
4921  The Scratch Wavefront Offset must also be used as an offset with Private
4922  segment address when using the Scratch Segment Buffer (see
4923  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4924
4925* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4926  specifies *Architected flat scratch*:
4927
4928  If ENABLE_PRIVATE_SEGMENT is enabled in
4929  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
4930  register pair will be initialized to the 64-bit address of the base of scratch
4931  backing memory being managed by SPI for the queue executing the kernel
4932  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4933  flat scratch base in flat memory instructions.
4934
4935.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4936
4937Private Segment Buffer
4938++++++++++++++++++++++
4939
4940If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4941*Architected flat scratch* then a Private Segment Buffer is not supported.
4942Instead the flat SCRATCH instructions are used.
4943
4944Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4945that are used as a V# to access scratch. CP uses the value provided by the
4946runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4947access the private memory space using a segment address. See
4948:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4949
4950The scratch V# is a four-aligned SGPR and always selected for the kernel as
4951follows:
4952
4953  - If it is known during instruction selection that there is stack usage,
4954    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4955    optimizations are disabled (``-O0``), if stack objects already exist (for
4956    locals, etc.), or if there are any function calls.
4957
4958  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4959    are reserved for the tentative scratch V#. These will be used if it is
4960    determined that spilling is needed.
4961
4962    - If no use is made of the tentative scratch V#, then it is unreserved,
4963      and the register count is determined ignoring it.
4964    - If use is made of the tentative scratch V#, then its register numbers
4965      are shifted to the first four-aligned SGPR index after the highest one
4966      allocated by the register allocator, and all uses are updated. The
4967      register count includes them in the shifted location.
4968    - In either case, if the processor has the SGPR allocation bug, the
4969      tentative allocation is not shifted or unreserved in order to ensure
4970      the register count is higher to workaround the bug.
4971
4972    .. note::
4973
4974      This approach of using a tentative scratch V# and shifting the register
4975      numbers if used avoids having to perform register allocation a second
4976      time if the tentative V# is eliminated. This is more efficient and
4977      avoids the problem that the second register allocation may perform
4978      spilling which will fail as there is no longer a scratch V#.
4979
4980When the kernel prolog code is being emitted it is known whether the scratch V#
4981described above is actually used. If it is, the prolog code must set it up by
4982copying the Private Segment Buffer to the scratch V# registers and then adding
4983the Private Segment Wavefront Offset to the queue base address in the V#. The
4984result is a V# with a base address pointing to the beginning of the wavefront
4985scratch backing memory.
4986
4987The Private Segment Buffer is always requested, but the Private Segment
4988Wavefront Offset is only requested if it is used (see
4989:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4990
4991.. _amdgpu-amdhsa-memory-model:
4992
4993Memory Model
4994~~~~~~~~~~~~
4995
4996This section describes the mapping of the LLVM memory model onto AMDGPU machine
4997code (see :ref:`memmodel`).
4998
4999The AMDGPU backend supports the memory synchronization scopes specified in
5000:ref:`amdgpu-memory-scopes`.
5001
5002The code sequences used to implement the memory model specify the order of
5003instructions that a single thread must execute. The ``s_waitcnt`` and cache
5004management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5005to other memory instructions executed by the same thread. This allows them to be
5006moved earlier or later which can allow them to be combined with other instances
5007of the same instruction, or hoisted/sunk out of loops to improve performance.
5008Only the instructions related to the memory model are given; additional
5009``s_waitcnt`` instructions are required to ensure registers are defined before
5010being used. These may be able to be combined with the memory model ``s_waitcnt``
5011instructions as described above.
5012
5013The AMDGPU backend supports the following memory models:
5014
5015  HSA Memory Model [HSA]_
5016    The HSA memory model uses a single happens-before relation for all address
5017    spaces (see :ref:`amdgpu-address-spaces`).
5018  OpenCL Memory Model [OpenCL]_
5019    The OpenCL memory model which has separate happens-before relations for the
5020    global and local address spaces. Only a fence specifying both global and
5021    local address space, and seq_cst instructions join the relationships. Since
5022    the LLVM ``memfence`` instruction does not allow an address space to be
5023    specified the OpenCL fence has to conservatively assume both local and
5024    global address space was specified. However, optimizations can often be
5025    done to eliminate the additional ``s_waitcnt`` instructions when there are
5026    no intervening memory instructions which access the corresponding address
5027    space. The code sequences in the table indicate what can be omitted for the
5028    OpenCL memory. The target triple environment is used to determine if the
5029    source language is OpenCL (see :ref:`amdgpu-opencl`).
5030
5031``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5032operations.
5033
5034``buffer/global/flat_load/store/atomic`` instructions to global memory are
5035termed vector memory operations.
5036
5037Private address space uses ``buffer_load/store`` using the scratch V#
5038(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5039is accessing the memory, atomic memory orderings are not meaningful, and all
5040accesses are treated as non-atomic.
5041
5042Constant address space uses ``buffer/global_load`` instructions (or equivalent
5043scalar memory instructions). Since the constant address space contents do not
5044change during the execution of a kernel dispatch it is not legal to perform
5045stores, and atomic memory orderings are not meaningful, and all accesses are
5046treated as non-atomic.
5047
5048A memory synchronization scope wider than work-group is not meaningful for the
5049group (LDS) address space and is treated as work-group.
5050
5051The memory model does not support the region address space which is treated as
5052non-atomic.
5053
5054Acquire memory ordering is not meaningful on store atomic instructions and is
5055treated as non-atomic.
5056
5057Release memory ordering is not meaningful on load atomic instructions and is
5058treated a non-atomic.
5059
5060Acquire-release memory ordering is not meaningful on load or store atomic
5061instructions and is treated as acquire and release respectively.
5062
5063The memory order also adds the single thread optimization constraints defined in
5064table
5065:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5066
5067  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5068     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5069
5070     ============ ==============================================================
5071     LLVM Memory  Optimization Constraints
5072     Ordering
5073     ============ ==============================================================
5074     unordered    *none*
5075     monotonic    *none*
5076     acquire      - If a load atomic/atomicrmw then no following load/load
5077                    atomic/store/store atomic/atomicrmw/fence instruction can be
5078                    moved before the acquire.
5079                  - If a fence then same as load atomic, plus no preceding
5080                    associated fence-paired-atomic can be moved after the fence.
5081     release      - If a store atomic/atomicrmw then no preceding load/load
5082                    atomic/store/store atomic/atomicrmw/fence instruction can be
5083                    moved after the release.
5084                  - If a fence then same as store atomic, plus no following
5085                    associated fence-paired-atomic can be moved before the
5086                    fence.
5087     acq_rel      Same constraints as both acquire and release.
5088     seq_cst      - If a load atomic then same constraints as acquire, plus no
5089                    preceding sequentially consistent load atomic/store
5090                    atomic/atomicrmw/fence instruction can be moved after the
5091                    seq_cst.
5092                  - If a store atomic then the same constraints as release, plus
5093                    no following sequentially consistent load atomic/store
5094                    atomic/atomicrmw/fence instruction can be moved before the
5095                    seq_cst.
5096                  - If an atomicrmw/fence then same constraints as acq_rel.
5097     ============ ==============================================================
5098
5099The code sequences used to implement the memory model are defined in the
5100following sections:
5101
5102* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5103* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5104* :ref:`amdgpu-amdhsa-memory-model-gfx940`
5105* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5106
5107.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5108
5109Memory Model GFX6-GFX9
5110++++++++++++++++++++++
5111
5112For GFX6-GFX9:
5113
5114* Each agent has multiple shader arrays (SA).
5115* Each SA has multiple compute units (CU).
5116* Each CU has multiple SIMDs that execute wavefronts.
5117* The wavefronts for a single work-group are executed in the same CU but may be
5118  executed by different SIMDs.
5119* Each CU has a single LDS memory shared by the wavefronts of the work-groups
5120  executing on it.
5121* All LDS operations of a CU are performed as wavefront wide operations in a
5122  global order and involve no caching. Completion is reported to a wavefront in
5123  execution order.
5124* The LDS memory has multiple request queues shared by the SIMDs of a
5125  CU. Therefore, the LDS operations performed by different wavefronts of a
5126  work-group can be reordered relative to each other, which can result in
5127  reordering the visibility of vector memory operations with respect to LDS
5128  operations of other wavefronts in the same work-group. A ``s_waitcnt
5129  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5130  vector memory operations between wavefronts of a work-group, but not between
5131  operations performed by the same wavefront.
5132* The vector memory operations are performed as wavefront wide operations and
5133  completion is reported to a wavefront in execution order. The exception is
5134  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5135  vector memory order if they access LDS memory, and out of LDS operation order
5136  if they access global memory.
5137* The vector memory operations access a single vector L1 cache shared by all
5138  SIMDs a CU. Therefore, no special action is required for coherence between the
5139  lanes of a single wavefront, or for coherence between wavefronts in the same
5140  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5141  wavefronts executing in different work-groups as they may be executing on
5142  different CUs.
5143* The scalar memory operations access a scalar L1 cache shared by all wavefronts
5144  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5145  scalar operations are used in a restricted way so do not impact the memory
5146  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5147* The vector and scalar memory operations use an L2 cache shared by all CUs on
5148  the same agent.
5149* The L2 cache has independent channels to service disjoint ranges of virtual
5150  addresses.
5151* Each CU has a separate request queue per channel. Therefore, the vector and
5152  scalar memory operations performed by wavefronts executing in different
5153  work-groups (which may be executing on different CUs) of an agent can be
5154  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5155  ensure synchronization between vector memory operations of different CUs. It
5156  ensures a previous vector memory operation has completed before executing a
5157  subsequent vector memory or LDS operation and so can be used to meet the
5158  requirements of acquire and release.
5159* The L2 cache can be kept coherent with other agents on some targets, or ranges
5160  of virtual addresses can be set up to bypass it to ensure system coherence.
5161
5162Scalar memory operations are only used to access memory that is proven to not
5163change during the execution of the kernel dispatch. This includes constant
5164address space and global address space for program scope ``const`` variables.
5165Therefore, the kernel machine code does not have to maintain the scalar cache to
5166ensure it is coherent with the vector caches. The scalar and vector caches are
5167invalidated between kernel dispatches by CP since constant address space data
5168may change between kernel dispatch executions. See
5169:ref:`amdgpu-amdhsa-memory-spaces`.
5170
5171The one exception is if scalar writes are used to spill SGPR registers. In this
5172case the AMDGPU backend ensures the memory location used to spill is never
5173accessed by vector memory operations at the same time. If scalar writes are used
5174then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5175return since the locations may be used for vector memory instructions by a
5176future wavefront that uses the same scratch area, or a function call that
5177creates a frame at the same address, respectively. There is no need for a
5178``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5179
5180For kernarg backing memory:
5181
5182* CP invalidates the L1 cache at the start of each kernel dispatch.
5183* On dGPU the kernarg backing memory is allocated in host memory accessed as
5184  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5185  causes it to be treated as non-volatile and so is not invalidated by
5186  ``*_vol``.
5187* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5188  and so the L2 cache will be coherent with the CPU and other agents.
5189
5190Scratch backing memory (which is used for the private address space) is accessed
5191with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5192only accessed by a single thread, and is always write-before-read, there is
5193never a need to invalidate these entries from the L1 cache. Hence all cache
5194invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5195
5196The code sequences used to implement the memory model for GFX6-GFX9 are defined
5197in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5198
5199  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5200     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5201
5202     ============ ============ ============== ========== ================================
5203     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
5204                  Ordering     Sync Scope     Address    GFX6-GFX9
5205                                              Space
5206     ============ ============ ============== ========== ================================
5207     **Non-Atomic**
5208     ------------------------------------------------------------------------------------
5209     load         *none*       *none*         - global   - !volatile & !nontemporal
5210                                              - generic
5211                                              - private    1. buffer/global/flat_load
5212                                              - constant
5213                                                         - !volatile & nontemporal
5214
5215                                                           1. buffer/global/flat_load
5216                                                              glc=1 slc=1
5217
5218                                                         - volatile
5219
5220                                                           1. buffer/global/flat_load
5221                                                              glc=1
5222                                                           2. s_waitcnt vmcnt(0)
5223
5224                                                            - Must happen before
5225                                                              any following volatile
5226                                                              global/generic
5227                                                              load/store.
5228                                                            - Ensures that
5229                                                              volatile
5230                                                              operations to
5231                                                              different
5232                                                              addresses will not
5233                                                              be reordered by
5234                                                              hardware.
5235
5236     load         *none*       *none*         - local    1. ds_load
5237     store        *none*       *none*         - global   - !volatile & !nontemporal
5238                                              - generic
5239                                              - private    1. buffer/global/flat_store
5240                                              - constant
5241                                                         - !volatile & nontemporal
5242
5243                                                           1. buffer/global/flat_store
5244                                                              glc=1 slc=1
5245
5246                                                         - volatile
5247
5248                                                           1. buffer/global/flat_store
5249                                                           2. s_waitcnt vmcnt(0)
5250
5251                                                            - Must happen before
5252                                                              any following volatile
5253                                                              global/generic
5254                                                              load/store.
5255                                                            - Ensures that
5256                                                              volatile
5257                                                              operations to
5258                                                              different
5259                                                              addresses will not
5260                                                              be reordered by
5261                                                              hardware.
5262
5263     store        *none*       *none*         - local    1. ds_store
5264     **Unordered Atomic**
5265     ------------------------------------------------------------------------------------
5266     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5267     store atomic unordered    *any*          *any*      *Same as non-atomic*.
5268     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5269     **Monotonic Atomic**
5270     ------------------------------------------------------------------------------------
5271     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5272                               - wavefront    - local
5273                               - workgroup    - generic
5274     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5275                               - system       - generic     glc=1
5276     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5277                               - wavefront    - generic
5278                               - workgroup
5279                               - agent
5280                               - system
5281     store atomic monotonic    - singlethread - local    1. ds_store
5282                               - wavefront
5283                               - workgroup
5284     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5285                               - wavefront    - generic
5286                               - workgroup
5287                               - agent
5288                               - system
5289     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5290                               - wavefront
5291                               - workgroup
5292     **Acquire Atomic**
5293     ------------------------------------------------------------------------------------
5294     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5295                               - wavefront    - local
5296                                              - generic
5297     load atomic  acquire      - workgroup    - global   1. buffer/global_load
5298     load atomic  acquire      - workgroup    - local    1. ds/flat_load
5299                                              - generic  2. s_waitcnt lgkmcnt(0)
5300
5301                                                           - If OpenCL, omit.
5302                                                           - Must happen before
5303                                                             any following
5304                                                             global/generic
5305                                                             load/load
5306                                                             atomic/store/store
5307                                                             atomic/atomicrmw.
5308                                                           - Ensures any
5309                                                             following global
5310                                                             data read is no
5311                                                             older than a local load
5312                                                             atomic value being
5313                                                             acquired.
5314
5315     load atomic  acquire      - agent        - global   1. buffer/global_load
5316                               - system                     glc=1
5317                                                         2. s_waitcnt vmcnt(0)
5318
5319                                                           - Must happen before
5320                                                             following
5321                                                             buffer_wbinvl1_vol.
5322                                                           - Ensures the load
5323                                                             has completed
5324                                                             before invalidating
5325                                                             the cache.
5326
5327                                                         3. buffer_wbinvl1_vol
5328
5329                                                           - Must happen before
5330                                                             any following
5331                                                             global/generic
5332                                                             load/load
5333                                                             atomic/atomicrmw.
5334                                                           - Ensures that
5335                                                             following
5336                                                             loads will not see
5337                                                             stale global data.
5338
5339     load atomic  acquire      - agent        - generic  1. flat_load glc=1
5340                               - system                  2. s_waitcnt vmcnt(0) &
5341                                                            lgkmcnt(0)
5342
5343                                                           - If OpenCL omit
5344                                                             lgkmcnt(0).
5345                                                           - Must happen before
5346                                                             following
5347                                                             buffer_wbinvl1_vol.
5348                                                           - Ensures the flat_load
5349                                                             has completed
5350                                                             before invalidating
5351                                                             the cache.
5352
5353                                                         3. buffer_wbinvl1_vol
5354
5355                                                           - Must happen before
5356                                                             any following
5357                                                             global/generic
5358                                                             load/load
5359                                                             atomic/atomicrmw.
5360                                                           - Ensures that
5361                                                             following loads
5362                                                             will not see stale
5363                                                             global data.
5364
5365     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5366                               - wavefront    - local
5367                                              - generic
5368     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5369     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5370                                              - generic  2. s_waitcnt lgkmcnt(0)
5371
5372                                                           - If OpenCL, omit.
5373                                                           - Must happen before
5374                                                             any following
5375                                                             global/generic
5376                                                             load/load
5377                                                             atomic/store/store
5378                                                             atomic/atomicrmw.
5379                                                           - Ensures any
5380                                                             following global
5381                                                             data read is no
5382                                                             older than a local
5383                                                             atomicrmw value
5384                                                             being acquired.
5385
5386     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5387                               - system                  2. s_waitcnt vmcnt(0)
5388
5389                                                           - Must happen before
5390                                                             following
5391                                                             buffer_wbinvl1_vol.
5392                                                           - Ensures the
5393                                                             atomicrmw has
5394                                                             completed before
5395                                                             invalidating the
5396                                                             cache.
5397
5398                                                         3. buffer_wbinvl1_vol
5399
5400                                                           - Must happen before
5401                                                             any following
5402                                                             global/generic
5403                                                             load/load
5404                                                             atomic/atomicrmw.
5405                                                           - Ensures that
5406                                                             following loads
5407                                                             will not see stale
5408                                                             global data.
5409
5410     atomicrmw    acquire      - agent        - generic  1. flat_atomic
5411                               - system                  2. s_waitcnt vmcnt(0) &
5412                                                            lgkmcnt(0)
5413
5414                                                           - If OpenCL, omit
5415                                                             lgkmcnt(0).
5416                                                           - Must happen before
5417                                                             following
5418                                                             buffer_wbinvl1_vol.
5419                                                           - Ensures the
5420                                                             atomicrmw has
5421                                                             completed before
5422                                                             invalidating the
5423                                                             cache.
5424
5425                                                         3. buffer_wbinvl1_vol
5426
5427                                                           - Must happen before
5428                                                             any following
5429                                                             global/generic
5430                                                             load/load
5431                                                             atomic/atomicrmw.
5432                                                           - Ensures that
5433                                                             following loads
5434                                                             will not see stale
5435                                                             global data.
5436
5437     fence        acquire      - singlethread *none*     *none*
5438                               - wavefront
5439     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5440
5441                                                           - If OpenCL and
5442                                                             address space is
5443                                                             not generic, omit.
5444                                                           - However, since LLVM
5445                                                             currently has no
5446                                                             address space on
5447                                                             the fence need to
5448                                                             conservatively
5449                                                             always generate. If
5450                                                             fence had an
5451                                                             address space then
5452                                                             set to address
5453                                                             space of OpenCL
5454                                                             fence flag, or to
5455                                                             generic if both
5456                                                             local and global
5457                                                             flags are
5458                                                             specified.
5459                                                           - Must happen after
5460                                                             any preceding
5461                                                             local/generic load
5462                                                             atomic/atomicrmw
5463                                                             with an equal or
5464                                                             wider sync scope
5465                                                             and memory ordering
5466                                                             stronger than
5467                                                             unordered (this is
5468                                                             termed the
5469                                                             fence-paired-atomic).
5470                                                           - Must happen before
5471                                                             any following
5472                                                             global/generic
5473                                                             load/load
5474                                                             atomic/store/store
5475                                                             atomic/atomicrmw.
5476                                                           - Ensures any
5477                                                             following global
5478                                                             data read is no
5479                                                             older than the
5480                                                             value read by the
5481                                                             fence-paired-atomic.
5482
5483     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5484                               - system                     vmcnt(0)
5485
5486                                                           - If OpenCL and
5487                                                             address space is
5488                                                             not generic, omit
5489                                                             lgkmcnt(0).
5490                                                           - However, since LLVM
5491                                                             currently has no
5492                                                             address space on
5493                                                             the fence need to
5494                                                             conservatively
5495                                                             always generate
5496                                                             (see comment for
5497                                                             previous fence).
5498                                                           - Could be split into
5499                                                             separate s_waitcnt
5500                                                             vmcnt(0) and
5501                                                             s_waitcnt
5502                                                             lgkmcnt(0) to allow
5503                                                             them to be
5504                                                             independently moved
5505                                                             according to the
5506                                                             following rules.
5507                                                           - s_waitcnt vmcnt(0)
5508                                                             must happen after
5509                                                             any preceding
5510                                                             global/generic load
5511                                                             atomic/atomicrmw
5512                                                             with an equal or
5513                                                             wider sync scope
5514                                                             and memory ordering
5515                                                             stronger than
5516                                                             unordered (this is
5517                                                             termed the
5518                                                             fence-paired-atomic).
5519                                                           - s_waitcnt lgkmcnt(0)
5520                                                             must happen after
5521                                                             any preceding
5522                                                             local/generic load
5523                                                             atomic/atomicrmw
5524                                                             with an equal or
5525                                                             wider sync scope
5526                                                             and memory ordering
5527                                                             stronger than
5528                                                             unordered (this is
5529                                                             termed the
5530                                                             fence-paired-atomic).
5531                                                           - Must happen before
5532                                                             the following
5533                                                             buffer_wbinvl1_vol.
5534                                                           - Ensures that the
5535                                                             fence-paired atomic
5536                                                             has completed
5537                                                             before invalidating
5538                                                             the
5539                                                             cache. Therefore
5540                                                             any following
5541                                                             locations read must
5542                                                             be no older than
5543                                                             the value read by
5544                                                             the
5545                                                             fence-paired-atomic.
5546
5547                                                         2. buffer_wbinvl1_vol
5548
5549                                                           - Must happen before any
5550                                                             following global/generic
5551                                                             load/load
5552                                                             atomic/store/store
5553                                                             atomic/atomicrmw.
5554                                                           - Ensures that
5555                                                             following loads
5556                                                             will not see stale
5557                                                             global data.
5558
5559     **Release Atomic**
5560     ------------------------------------------------------------------------------------
5561     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5562                               - wavefront    - local
5563                                              - generic
5564     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5565                                              - generic
5566                                                           - If OpenCL, omit.
5567                                                           - Must happen after
5568                                                             any preceding
5569                                                             local/generic
5570                                                             load/store/load
5571                                                             atomic/store
5572                                                             atomic/atomicrmw.
5573                                                           - Must happen before
5574                                                             the following
5575                                                             store.
5576                                                           - Ensures that all
5577                                                             memory operations
5578                                                             to local have
5579                                                             completed before
5580                                                             performing the
5581                                                             store that is being
5582                                                             released.
5583
5584                                                         2. buffer/global/flat_store
5585     store atomic release      - workgroup    - local    1. ds_store
5586     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5587                               - system       - generic     vmcnt(0)
5588
5589                                                           - If OpenCL and
5590                                                             address space is
5591                                                             not generic, omit
5592                                                             lgkmcnt(0).
5593                                                           - Could be split into
5594                                                             separate s_waitcnt
5595                                                             vmcnt(0) and
5596                                                             s_waitcnt
5597                                                             lgkmcnt(0) to allow
5598                                                             them to be
5599                                                             independently moved
5600                                                             according to the
5601                                                             following rules.
5602                                                           - s_waitcnt vmcnt(0)
5603                                                             must happen after
5604                                                             any preceding
5605                                                             global/generic
5606                                                             load/store/load
5607                                                             atomic/store
5608                                                             atomic/atomicrmw.
5609                                                           - s_waitcnt lgkmcnt(0)
5610                                                             must happen after
5611                                                             any preceding
5612                                                             local/generic
5613                                                             load/store/load
5614                                                             atomic/store
5615                                                             atomic/atomicrmw.
5616                                                           - Must happen before
5617                                                             the following
5618                                                             store.
5619                                                           - Ensures that all
5620                                                             memory operations
5621                                                             to memory have
5622                                                             completed before
5623                                                             performing the
5624                                                             store that is being
5625                                                             released.
5626
5627                                                         2. buffer/global/flat_store
5628     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5629                               - wavefront    - local
5630                                              - generic
5631     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5632                                              - generic
5633                                                           - If OpenCL, omit.
5634                                                           - Must happen after
5635                                                             any preceding
5636                                                             local/generic
5637                                                             load/store/load
5638                                                             atomic/store
5639                                                             atomic/atomicrmw.
5640                                                           - Must happen before
5641                                                             the following
5642                                                             atomicrmw.
5643                                                           - Ensures that all
5644                                                             memory operations
5645                                                             to local have
5646                                                             completed before
5647                                                             performing the
5648                                                             atomicrmw that is
5649                                                             being released.
5650
5651                                                         2. buffer/global/flat_atomic
5652     atomicrmw    release      - workgroup    - local    1. ds_atomic
5653     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5654                               - system       - generic     vmcnt(0)
5655
5656                                                           - If OpenCL, omit
5657                                                             lgkmcnt(0).
5658                                                           - Could be split into
5659                                                             separate s_waitcnt
5660                                                             vmcnt(0) and
5661                                                             s_waitcnt
5662                                                             lgkmcnt(0) to allow
5663                                                             them to be
5664                                                             independently moved
5665                                                             according to the
5666                                                             following rules.
5667                                                           - s_waitcnt vmcnt(0)
5668                                                             must happen after
5669                                                             any preceding
5670                                                             global/generic
5671                                                             load/store/load
5672                                                             atomic/store
5673                                                             atomic/atomicrmw.
5674                                                           - s_waitcnt lgkmcnt(0)
5675                                                             must happen after
5676                                                             any preceding
5677                                                             local/generic
5678                                                             load/store/load
5679                                                             atomic/store
5680                                                             atomic/atomicrmw.
5681                                                           - Must happen before
5682                                                             the following
5683                                                             atomicrmw.
5684                                                           - Ensures that all
5685                                                             memory operations
5686                                                             to global and local
5687                                                             have completed
5688                                                             before performing
5689                                                             the atomicrmw that
5690                                                             is being released.
5691
5692                                                         2. buffer/global/flat_atomic
5693     fence        release      - singlethread *none*     *none*
5694                               - wavefront
5695     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5696
5697                                                           - If OpenCL and
5698                                                             address space is
5699                                                             not generic, omit.
5700                                                           - However, since LLVM
5701                                                             currently has no
5702                                                             address space on
5703                                                             the fence need to
5704                                                             conservatively
5705                                                             always generate. If
5706                                                             fence had an
5707                                                             address space then
5708                                                             set to address
5709                                                             space of OpenCL
5710                                                             fence flag, or to
5711                                                             generic if both
5712                                                             local and global
5713                                                             flags are
5714                                                             specified.
5715                                                           - Must happen after
5716                                                             any preceding
5717                                                             local/generic
5718                                                             load/load
5719                                                             atomic/store/store
5720                                                             atomic/atomicrmw.
5721                                                           - Must happen before
5722                                                             any following store
5723                                                             atomic/atomicrmw
5724                                                             with an equal or
5725                                                             wider sync scope
5726                                                             and memory ordering
5727                                                             stronger than
5728                                                             unordered (this is
5729                                                             termed the
5730                                                             fence-paired-atomic).
5731                                                           - Ensures that all
5732                                                             memory operations
5733                                                             to local have
5734                                                             completed before
5735                                                             performing the
5736                                                             following
5737                                                             fence-paired-atomic.
5738
5739     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5740                               - system                     vmcnt(0)
5741
5742                                                           - If OpenCL and
5743                                                             address space is
5744                                                             not generic, omit
5745                                                             lgkmcnt(0).
5746                                                           - If OpenCL and
5747                                                             address space is
5748                                                             local, omit
5749                                                             vmcnt(0).
5750                                                           - However, since LLVM
5751                                                             currently has no
5752                                                             address space on
5753                                                             the fence need to
5754                                                             conservatively
5755                                                             always generate. If
5756                                                             fence had an
5757                                                             address space then
5758                                                             set to address
5759                                                             space of OpenCL
5760                                                             fence flag, or to
5761                                                             generic if both
5762                                                             local and global
5763                                                             flags are
5764                                                             specified.
5765                                                           - Could be split into
5766                                                             separate s_waitcnt
5767                                                             vmcnt(0) and
5768                                                             s_waitcnt
5769                                                             lgkmcnt(0) to allow
5770                                                             them to be
5771                                                             independently moved
5772                                                             according to the
5773                                                             following rules.
5774                                                           - s_waitcnt vmcnt(0)
5775                                                             must happen after
5776                                                             any preceding
5777                                                             global/generic
5778                                                             load/store/load
5779                                                             atomic/store
5780                                                             atomic/atomicrmw.
5781                                                           - s_waitcnt lgkmcnt(0)
5782                                                             must happen after
5783                                                             any preceding
5784                                                             local/generic
5785                                                             load/store/load
5786                                                             atomic/store
5787                                                             atomic/atomicrmw.
5788                                                           - Must happen before
5789                                                             any following store
5790                                                             atomic/atomicrmw
5791                                                             with an equal or
5792                                                             wider sync scope
5793                                                             and memory ordering
5794                                                             stronger than
5795                                                             unordered (this is
5796                                                             termed the
5797                                                             fence-paired-atomic).
5798                                                           - Ensures that all
5799                                                             memory operations
5800                                                             have
5801                                                             completed before
5802                                                             performing the
5803                                                             following
5804                                                             fence-paired-atomic.
5805
5806     **Acquire-Release Atomic**
5807     ------------------------------------------------------------------------------------
5808     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5809                               - wavefront    - local
5810                                              - generic
5811     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5812
5813                                                           - If OpenCL, omit.
5814                                                           - Must happen after
5815                                                             any preceding
5816                                                             local/generic
5817                                                             load/store/load
5818                                                             atomic/store
5819                                                             atomic/atomicrmw.
5820                                                           - Must happen before
5821                                                             the following
5822                                                             atomicrmw.
5823                                                           - Ensures that all
5824                                                             memory operations
5825                                                             to local have
5826                                                             completed before
5827                                                             performing the
5828                                                             atomicrmw that is
5829                                                             being released.
5830
5831                                                         2. buffer/global_atomic
5832
5833     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5834                                                         2. s_waitcnt lgkmcnt(0)
5835
5836                                                           - If OpenCL, omit.
5837                                                           - Must happen before
5838                                                             any following
5839                                                             global/generic
5840                                                             load/load
5841                                                             atomic/store/store
5842                                                             atomic/atomicrmw.
5843                                                           - Ensures any
5844                                                             following global
5845                                                             data read is no
5846                                                             older than the local load
5847                                                             atomic value being
5848                                                             acquired.
5849
5850     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5851
5852                                                           - If OpenCL, omit.
5853                                                           - Must happen after
5854                                                             any preceding
5855                                                             local/generic
5856                                                             load/store/load
5857                                                             atomic/store
5858                                                             atomic/atomicrmw.
5859                                                           - Must happen before
5860                                                             the following
5861                                                             atomicrmw.
5862                                                           - Ensures that all
5863                                                             memory operations
5864                                                             to local have
5865                                                             completed before
5866                                                             performing the
5867                                                             atomicrmw that is
5868                                                             being released.
5869
5870                                                         2. flat_atomic
5871                                                         3. s_waitcnt lgkmcnt(0)
5872
5873                                                           - If OpenCL, omit.
5874                                                           - Must happen before
5875                                                             any following
5876                                                             global/generic
5877                                                             load/load
5878                                                             atomic/store/store
5879                                                             atomic/atomicrmw.
5880                                                           - Ensures any
5881                                                             following global
5882                                                             data read is no
5883                                                             older than a local load
5884                                                             atomic value being
5885                                                             acquired.
5886
5887     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5888                               - system                     vmcnt(0)
5889
5890                                                           - If OpenCL, omit
5891                                                             lgkmcnt(0).
5892                                                           - Could be split into
5893                                                             separate s_waitcnt
5894                                                             vmcnt(0) and
5895                                                             s_waitcnt
5896                                                             lgkmcnt(0) to allow
5897                                                             them to be
5898                                                             independently moved
5899                                                             according to the
5900                                                             following rules.
5901                                                           - s_waitcnt vmcnt(0)
5902                                                             must happen after
5903                                                             any preceding
5904                                                             global/generic
5905                                                             load/store/load
5906                                                             atomic/store
5907                                                             atomic/atomicrmw.
5908                                                           - s_waitcnt lgkmcnt(0)
5909                                                             must happen after
5910                                                             any preceding
5911                                                             local/generic
5912                                                             load/store/load
5913                                                             atomic/store
5914                                                             atomic/atomicrmw.
5915                                                           - Must happen before
5916                                                             the following
5917                                                             atomicrmw.
5918                                                           - Ensures that all
5919                                                             memory operations
5920                                                             to global have
5921                                                             completed before
5922                                                             performing the
5923                                                             atomicrmw that is
5924                                                             being released.
5925
5926                                                         2. buffer/global_atomic
5927                                                         3. s_waitcnt vmcnt(0)
5928
5929                                                           - Must happen before
5930                                                             following
5931                                                             buffer_wbinvl1_vol.
5932                                                           - Ensures the
5933                                                             atomicrmw has
5934                                                             completed before
5935                                                             invalidating the
5936                                                             cache.
5937
5938                                                         4. buffer_wbinvl1_vol
5939
5940                                                           - Must happen before
5941                                                             any following
5942                                                             global/generic
5943                                                             load/load
5944                                                             atomic/atomicrmw.
5945                                                           - Ensures that
5946                                                             following loads
5947                                                             will not see stale
5948                                                             global data.
5949
5950     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5951                               - system                     vmcnt(0)
5952
5953                                                           - If OpenCL, omit
5954                                                             lgkmcnt(0).
5955                                                           - Could be split into
5956                                                             separate s_waitcnt
5957                                                             vmcnt(0) and
5958                                                             s_waitcnt
5959                                                             lgkmcnt(0) to allow
5960                                                             them to be
5961                                                             independently moved
5962                                                             according to the
5963                                                             following rules.
5964                                                           - s_waitcnt vmcnt(0)
5965                                                             must happen after
5966                                                             any preceding
5967                                                             global/generic
5968                                                             load/store/load
5969                                                             atomic/store
5970                                                             atomic/atomicrmw.
5971                                                           - s_waitcnt lgkmcnt(0)
5972                                                             must happen after
5973                                                             any preceding
5974                                                             local/generic
5975                                                             load/store/load
5976                                                             atomic/store
5977                                                             atomic/atomicrmw.
5978                                                           - Must happen before
5979                                                             the following
5980                                                             atomicrmw.
5981                                                           - Ensures that all
5982                                                             memory operations
5983                                                             to global have
5984                                                             completed before
5985                                                             performing the
5986                                                             atomicrmw that is
5987                                                             being released.
5988
5989                                                         2. flat_atomic
5990                                                         3. s_waitcnt vmcnt(0) &
5991                                                            lgkmcnt(0)
5992
5993                                                           - If OpenCL, omit
5994                                                             lgkmcnt(0).
5995                                                           - Must happen before
5996                                                             following
5997                                                             buffer_wbinvl1_vol.
5998                                                           - Ensures the
5999                                                             atomicrmw has
6000                                                             completed before
6001                                                             invalidating the
6002                                                             cache.
6003
6004                                                         4. buffer_wbinvl1_vol
6005
6006                                                           - Must happen before
6007                                                             any following
6008                                                             global/generic
6009                                                             load/load
6010                                                             atomic/atomicrmw.
6011                                                           - Ensures that
6012                                                             following loads
6013                                                             will not see stale
6014                                                             global data.
6015
6016     fence        acq_rel      - singlethread *none*     *none*
6017                               - wavefront
6018     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6019
6020                                                           - If OpenCL and
6021                                                             address space is
6022                                                             not generic, omit.
6023                                                           - However,
6024                                                             since LLVM
6025                                                             currently has no
6026                                                             address space on
6027                                                             the fence need to
6028                                                             conservatively
6029                                                             always generate
6030                                                             (see comment for
6031                                                             previous fence).
6032                                                           - Must happen after
6033                                                             any preceding
6034                                                             local/generic
6035                                                             load/load
6036                                                             atomic/store/store
6037                                                             atomic/atomicrmw.
6038                                                           - Must happen before
6039                                                             any following
6040                                                             global/generic
6041                                                             load/load
6042                                                             atomic/store/store
6043                                                             atomic/atomicrmw.
6044                                                           - Ensures that all
6045                                                             memory operations
6046                                                             to local have
6047                                                             completed before
6048                                                             performing any
6049                                                             following global
6050                                                             memory operations.
6051                                                           - Ensures that the
6052                                                             preceding
6053                                                             local/generic load
6054                                                             atomic/atomicrmw
6055                                                             with an equal or
6056                                                             wider sync scope
6057                                                             and memory ordering
6058                                                             stronger than
6059                                                             unordered (this is
6060                                                             termed the
6061                                                             acquire-fence-paired-atomic)
6062                                                             has completed
6063                                                             before following
6064                                                             global memory
6065                                                             operations. This
6066                                                             satisfies the
6067                                                             requirements of
6068                                                             acquire.
6069                                                           - Ensures that all
6070                                                             previous memory
6071                                                             operations have
6072                                                             completed before a
6073                                                             following
6074                                                             local/generic store
6075                                                             atomic/atomicrmw
6076                                                             with an equal or
6077                                                             wider sync scope
6078                                                             and memory ordering
6079                                                             stronger than
6080                                                             unordered (this is
6081                                                             termed the
6082                                                             release-fence-paired-atomic).
6083                                                             This satisfies the
6084                                                             requirements of
6085                                                             release.
6086
6087     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6088                               - system                     vmcnt(0)
6089
6090                                                           - If OpenCL and
6091                                                             address space is
6092                                                             not generic, omit
6093                                                             lgkmcnt(0).
6094                                                           - However, since LLVM
6095                                                             currently has no
6096                                                             address space on
6097                                                             the fence need to
6098                                                             conservatively
6099                                                             always generate
6100                                                             (see comment for
6101                                                             previous fence).
6102                                                           - Could be split into
6103                                                             separate s_waitcnt
6104                                                             vmcnt(0) and
6105                                                             s_waitcnt
6106                                                             lgkmcnt(0) to allow
6107                                                             them to be
6108                                                             independently moved
6109                                                             according to the
6110                                                             following rules.
6111                                                           - s_waitcnt vmcnt(0)
6112                                                             must happen after
6113                                                             any preceding
6114                                                             global/generic
6115                                                             load/store/load
6116                                                             atomic/store
6117                                                             atomic/atomicrmw.
6118                                                           - s_waitcnt lgkmcnt(0)
6119                                                             must happen after
6120                                                             any preceding
6121                                                             local/generic
6122                                                             load/store/load
6123                                                             atomic/store
6124                                                             atomic/atomicrmw.
6125                                                           - Must happen before
6126                                                             the following
6127                                                             buffer_wbinvl1_vol.
6128                                                           - Ensures that the
6129                                                             preceding
6130                                                             global/local/generic
6131                                                             load
6132                                                             atomic/atomicrmw
6133                                                             with an equal or
6134                                                             wider sync scope
6135                                                             and memory ordering
6136                                                             stronger than
6137                                                             unordered (this is
6138                                                             termed the
6139                                                             acquire-fence-paired-atomic)
6140                                                             has completed
6141                                                             before invalidating
6142                                                             the cache. This
6143                                                             satisfies the
6144                                                             requirements of
6145                                                             acquire.
6146                                                           - Ensures that all
6147                                                             previous memory
6148                                                             operations have
6149                                                             completed before a
6150                                                             following
6151                                                             global/local/generic
6152                                                             store
6153                                                             atomic/atomicrmw
6154                                                             with an equal or
6155                                                             wider sync scope
6156                                                             and memory ordering
6157                                                             stronger than
6158                                                             unordered (this is
6159                                                             termed the
6160                                                             release-fence-paired-atomic).
6161                                                             This satisfies the
6162                                                             requirements of
6163                                                             release.
6164
6165                                                         2. buffer_wbinvl1_vol
6166
6167                                                           - Must happen before
6168                                                             any following
6169                                                             global/generic
6170                                                             load/load
6171                                                             atomic/store/store
6172                                                             atomic/atomicrmw.
6173                                                           - Ensures that
6174                                                             following loads
6175                                                             will not see stale
6176                                                             global data. This
6177                                                             satisfies the
6178                                                             requirements of
6179                                                             acquire.
6180
6181     **Sequential Consistent Atomic**
6182     ------------------------------------------------------------------------------------
6183     load atomic  seq_cst      - singlethread - global   *Same as corresponding
6184                               - wavefront    - local    load atomic acquire,
6185                                              - generic  except must generate
6186                                                         all instructions even
6187                                                         for OpenCL.*
6188     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6189                                              - generic
6190
6191                                                           - Must
6192                                                             happen after
6193                                                             preceding
6194                                                             local/generic load
6195                                                             atomic/store
6196                                                             atomic/atomicrmw
6197                                                             with memory
6198                                                             ordering of seq_cst
6199                                                             and with equal or
6200                                                             wider sync scope.
6201                                                             (Note that seq_cst
6202                                                             fences have their
6203                                                             own s_waitcnt
6204                                                             lgkmcnt(0) and so do
6205                                                             not need to be
6206                                                             considered.)
6207                                                           - Ensures any
6208                                                             preceding
6209                                                             sequential
6210                                                             consistent local
6211                                                             memory instructions
6212                                                             have completed
6213                                                             before executing
6214                                                             this sequentially
6215                                                             consistent
6216                                                             instruction. This
6217                                                             prevents reordering
6218                                                             a seq_cst store
6219                                                             followed by a
6220                                                             seq_cst load. (Note
6221                                                             that seq_cst is
6222                                                             stronger than
6223                                                             acquire/release as
6224                                                             the reordering of
6225                                                             load acquire
6226                                                             followed by a store
6227                                                             release is
6228                                                             prevented by the
6229                                                             s_waitcnt of
6230                                                             the release, but
6231                                                             there is nothing
6232                                                             preventing a store
6233                                                             release followed by
6234                                                             load acquire from
6235                                                             completing out of
6236                                                             order. The s_waitcnt
6237                                                             could be placed after
6238                                                             seq_store or before
6239                                                             the seq_load. We
6240                                                             choose the load to
6241                                                             make the s_waitcnt be
6242                                                             as late as possible
6243                                                             so that the store
6244                                                             may have already
6245                                                             completed.)
6246
6247                                                         2. *Following
6248                                                            instructions same as
6249                                                            corresponding load
6250                                                            atomic acquire,
6251                                                            except must generate
6252                                                            all instructions even
6253                                                            for OpenCL.*
6254     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6255                                                         load atomic acquire,
6256                                                         except must generate
6257                                                         all instructions even
6258                                                         for OpenCL.*
6259
6260     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6261                               - system       - generic     vmcnt(0)
6262
6263                                                           - Could be split into
6264                                                             separate s_waitcnt
6265                                                             vmcnt(0)
6266                                                             and s_waitcnt
6267                                                             lgkmcnt(0) to allow
6268                                                             them to be
6269                                                             independently moved
6270                                                             according to the
6271                                                             following rules.
6272                                                           - s_waitcnt lgkmcnt(0)
6273                                                             must happen after
6274                                                             preceding
6275                                                             global/generic load
6276                                                             atomic/store
6277                                                             atomic/atomicrmw
6278                                                             with memory
6279                                                             ordering of seq_cst
6280                                                             and with equal or
6281                                                             wider sync scope.
6282                                                             (Note that seq_cst
6283                                                             fences have their
6284                                                             own s_waitcnt
6285                                                             lgkmcnt(0) and so do
6286                                                             not need to be
6287                                                             considered.)
6288                                                           - s_waitcnt vmcnt(0)
6289                                                             must happen after
6290                                                             preceding
6291                                                             global/generic load
6292                                                             atomic/store
6293                                                             atomic/atomicrmw
6294                                                             with memory
6295                                                             ordering of seq_cst
6296                                                             and with equal or
6297                                                             wider sync scope.
6298                                                             (Note that seq_cst
6299                                                             fences have their
6300                                                             own s_waitcnt
6301                                                             vmcnt(0) and so do
6302                                                             not need to be
6303                                                             considered.)
6304                                                           - Ensures any
6305                                                             preceding
6306                                                             sequential
6307                                                             consistent global
6308                                                             memory instructions
6309                                                             have completed
6310                                                             before executing
6311                                                             this sequentially
6312                                                             consistent
6313                                                             instruction. This
6314                                                             prevents reordering
6315                                                             a seq_cst store
6316                                                             followed by a
6317                                                             seq_cst load. (Note
6318                                                             that seq_cst is
6319                                                             stronger than
6320                                                             acquire/release as
6321                                                             the reordering of
6322                                                             load acquire
6323                                                             followed by a store
6324                                                             release is
6325                                                             prevented by the
6326                                                             s_waitcnt of
6327                                                             the release, but
6328                                                             there is nothing
6329                                                             preventing a store
6330                                                             release followed by
6331                                                             load acquire from
6332                                                             completing out of
6333                                                             order. The s_waitcnt
6334                                                             could be placed after
6335                                                             seq_store or before
6336                                                             the seq_load. We
6337                                                             choose the load to
6338                                                             make the s_waitcnt be
6339                                                             as late as possible
6340                                                             so that the store
6341                                                             may have already
6342                                                             completed.)
6343
6344                                                         2. *Following
6345                                                            instructions same as
6346                                                            corresponding load
6347                                                            atomic acquire,
6348                                                            except must generate
6349                                                            all instructions even
6350                                                            for OpenCL.*
6351     store atomic seq_cst      - singlethread - global   *Same as corresponding
6352                               - wavefront    - local    store atomic release,
6353                               - workgroup    - generic  except must generate
6354                               - agent                   all instructions even
6355                               - system                  for OpenCL.*
6356     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6357                               - wavefront    - local    atomicrmw acq_rel,
6358                               - workgroup    - generic  except must generate
6359                               - agent                   all instructions even
6360                               - system                  for OpenCL.*
6361     fence        seq_cst      - singlethread *none*     *Same as corresponding
6362                               - wavefront               fence acq_rel,
6363                               - workgroup               except must generate
6364                               - agent                   all instructions even
6365                               - system                  for OpenCL.*
6366     ============ ============ ============== ========== ================================
6367
6368.. _amdgpu-amdhsa-memory-model-gfx90a:
6369
6370Memory Model GFX90A
6371+++++++++++++++++++
6372
6373For GFX90A:
6374
6375* Each agent has multiple shader arrays (SA).
6376* Each SA has multiple compute units (CU).
6377* Each CU has multiple SIMDs that execute wavefronts.
6378* The wavefronts for a single work-group are executed in the same CU but may be
6379  executed by different SIMDs. The exception is when in tgsplit execution mode
6380  when the wavefronts may be executed by different SIMDs in different CUs.
6381* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6382  executing on it. The exception is when in tgsplit execution mode when no LDS
6383  is allocated as wavefronts of the same work-group can be in different CUs.
6384* All LDS operations of a CU are performed as wavefront wide operations in a
6385  global order and involve no caching. Completion is reported to a wavefront in
6386  execution order.
6387* The LDS memory has multiple request queues shared by the SIMDs of a
6388  CU. Therefore, the LDS operations performed by different wavefronts of a
6389  work-group can be reordered relative to each other, which can result in
6390  reordering the visibility of vector memory operations with respect to LDS
6391  operations of other wavefronts in the same work-group. A ``s_waitcnt
6392  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6393  vector memory operations between wavefronts of a work-group, but not between
6394  operations performed by the same wavefront.
6395* The vector memory operations are performed as wavefront wide operations and
6396  completion is reported to a wavefront in execution order. The exception is
6397  that ``flat_load/store/atomic`` instructions can report out of vector memory
6398  order if they access LDS memory, and out of LDS operation order if they access
6399  global memory.
6400* The vector memory operations access a single vector L1 cache shared by all
6401  SIMDs a CU. Therefore:
6402
6403  * No special action is required for coherence between the lanes of a single
6404    wavefront.
6405
6406  * No special action is required for coherence between wavefronts in the same
6407    work-group since they execute on the same CU. The exception is when in
6408    tgsplit execution mode as wavefronts of the same work-group can be in
6409    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6410    the following item.
6411
6412  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6413    executing in different work-groups as they may be executing on different
6414    CUs.
6415
6416* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6417  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6418  scalar operations are used in a restricted way so do not impact the memory
6419  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6420* The vector and scalar memory operations use an L2 cache shared by all CUs on
6421  the same agent.
6422
6423  * The L2 cache has independent channels to service disjoint ranges of virtual
6424    addresses.
6425  * Each CU has a separate request queue per channel. Therefore, the vector and
6426    scalar memory operations performed by wavefronts executing in different
6427    work-groups (which may be executing on different CUs), or the same
6428    work-group if executing in tgsplit mode, of an agent can be reordered
6429    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6430    synchronization between vector memory operations of different CUs. It
6431    ensures a previous vector memory operation has completed before executing a
6432    subsequent vector memory or LDS operation and so can be used to meet the
6433    requirements of acquire and release.
6434  * The L2 cache of one agent can be kept coherent with other agents by:
6435    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6436    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6437    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6438
6439    * Any local memory cache lines will be automatically invalidated by writes
6440      from CUs associated with other L2 caches, or writes from the CPU, due to
6441      the cache probe caused by coherent requests. Coherent requests are caused
6442      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6443      XGMI, and by PCIe requests that are configured to be coherent requests.
6444    * XGMI accesses from the CPU to local memory may be cached on the CPU.
6445      Subsequent access from the GPU will automatically invalidate or writeback
6446      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6447    * Since all work-groups on the same agent share the same L2, no L2
6448      invalidation or writeback is required for coherence.
6449    * To ensure coherence of local and remote memory writes of work-groups in
6450      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6451      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6452      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6453      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6454      remote fine grain memory) bypasses the L2, so both will never result in
6455      dirty L2 cache lines.
6456    * To ensure coherence of local and remote memory reads of work-groups in
6457      different agents a ``buffer_invl2`` is required. It will invalidate L2
6458      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6459      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6460      coarse memory) cause local reads to be invalidated by remote writes with
6461      with the PTE C-bit so these cache lines are not invalidated. Note that
6462      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6463      never result in L2 cache lines that need to be invalidated.
6464
6465  * PCIe access from the GPU to the CPU memory is kept coherent by using the
6466    MTYPE UC (uncached) which bypasses the L2.
6467
6468Scalar memory operations are only used to access memory that is proven to not
6469change during the execution of the kernel dispatch. This includes constant
6470address space and global address space for program scope ``const`` variables.
6471Therefore, the kernel machine code does not have to maintain the scalar cache to
6472ensure it is coherent with the vector caches. The scalar and vector caches are
6473invalidated between kernel dispatches by CP since constant address space data
6474may change between kernel dispatch executions. See
6475:ref:`amdgpu-amdhsa-memory-spaces`.
6476
6477The one exception is if scalar writes are used to spill SGPR registers. In this
6478case the AMDGPU backend ensures the memory location used to spill is never
6479accessed by vector memory operations at the same time. If scalar writes are used
6480then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6481return since the locations may be used for vector memory instructions by a
6482future wavefront that uses the same scratch area, or a function call that
6483creates a frame at the same address, respectively. There is no need for a
6484``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6485
6486For kernarg backing memory:
6487
6488* CP invalidates the L1 cache at the start of each kernel dispatch.
6489* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6490  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6491  cache. This also causes it to be treated as non-volatile and so is not
6492  invalidated by ``*_vol``.
6493* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6494  so the L2 cache will be coherent with the CPU and other agents.
6495
6496Scratch backing memory (which is used for the private address space) is accessed
6497with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6498only accessed by a single thread, and is always write-before-read, there is
6499never a need to invalidate these entries from the L1 cache. Hence all cache
6500invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6501
6502The code sequences used to implement the memory model for GFX90A are defined
6503in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6504
6505  .. table:: AMDHSA Memory Model Code Sequences GFX90A
6506     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6507
6508     ============ ============ ============== ========== ================================
6509     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6510                  Ordering     Sync Scope     Address    GFX90A
6511                                              Space
6512     ============ ============ ============== ========== ================================
6513     **Non-Atomic**
6514     ------------------------------------------------------------------------------------
6515     load         *none*       *none*         - global   - !volatile & !nontemporal
6516                                              - generic
6517                                              - private    1. buffer/global/flat_load
6518                                              - constant
6519                                                         - !volatile & nontemporal
6520
6521                                                           1. buffer/global/flat_load
6522                                                              glc=1 slc=1
6523
6524                                                         - volatile
6525
6526                                                           1. buffer/global/flat_load
6527                                                              glc=1
6528                                                           2. s_waitcnt vmcnt(0)
6529
6530                                                            - Must happen before
6531                                                              any following volatile
6532                                                              global/generic
6533                                                              load/store.
6534                                                            - Ensures that
6535                                                              volatile
6536                                                              operations to
6537                                                              different
6538                                                              addresses will not
6539                                                              be reordered by
6540                                                              hardware.
6541
6542     load         *none*       *none*         - local    1. ds_load
6543     store        *none*       *none*         - global   - !volatile & !nontemporal
6544                                              - generic
6545                                              - private    1. buffer/global/flat_store
6546                                              - constant
6547                                                         - !volatile & nontemporal
6548
6549                                                           1. buffer/global/flat_store
6550                                                              glc=1 slc=1
6551
6552                                                         - volatile
6553
6554                                                           1. buffer/global/flat_store
6555                                                           2. s_waitcnt vmcnt(0)
6556
6557                                                            - Must happen before
6558                                                              any following volatile
6559                                                              global/generic
6560                                                              load/store.
6561                                                            - Ensures that
6562                                                              volatile
6563                                                              operations to
6564                                                              different
6565                                                              addresses will not
6566                                                              be reordered by
6567                                                              hardware.
6568
6569     store        *none*       *none*         - local    1. ds_store
6570     **Unordered Atomic**
6571     ------------------------------------------------------------------------------------
6572     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6573     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6574     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6575     **Monotonic Atomic**
6576     ------------------------------------------------------------------------------------
6577     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6578                               - wavefront    - generic
6579     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6580                                              - generic     glc=1
6581
6582                                                           - If not TgSplit execution
6583                                                             mode, omit glc=1.
6584
6585     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6586                               - wavefront               local address space cannot
6587                               - workgroup               be used.*
6588
6589                                                         1. ds_load
6590     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6591                                              - generic     glc=1
6592     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6593                                              - generic     glc=1
6594     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6595                               - wavefront    - generic
6596                               - workgroup
6597                               - agent
6598     store atomic monotonic    - system       - global   1. buffer/global/flat_store
6599                                              - generic
6600     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6601                               - wavefront               local address space cannot
6602                               - workgroup               be used.*
6603
6604                                                         1. ds_store
6605     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6606                               - wavefront    - generic
6607                               - workgroup
6608                               - agent
6609     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6610                                              - generic
6611     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6612                               - wavefront               local address space cannot
6613                               - workgroup               be used.*
6614
6615                                                         1. ds_atomic
6616     **Acquire Atomic**
6617     ------------------------------------------------------------------------------------
6618     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6619                               - wavefront    - local
6620                                              - generic
6621     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6622
6623                                                           - If not TgSplit execution
6624                                                             mode, omit glc=1.
6625
6626                                                         2. s_waitcnt vmcnt(0)
6627
6628                                                           - If not TgSplit execution
6629                                                             mode, omit.
6630                                                           - Must happen before the
6631                                                             following buffer_wbinvl1_vol.
6632
6633                                                         3. buffer_wbinvl1_vol
6634
6635                                                           - If not TgSplit execution
6636                                                             mode, omit.
6637                                                           - Must happen before
6638                                                             any following
6639                                                             global/generic
6640                                                             load/load
6641                                                             atomic/store/store
6642                                                             atomic/atomicrmw.
6643                                                           - Ensures that
6644                                                             following
6645                                                             loads will not see
6646                                                             stale data.
6647
6648     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6649                                                         local address space cannot
6650                                                         be used.*
6651
6652                                                         1. ds_load
6653                                                         2. s_waitcnt lgkmcnt(0)
6654
6655                                                           - If OpenCL, omit.
6656                                                           - Must happen before
6657                                                             any following
6658                                                             global/generic
6659                                                             load/load
6660                                                             atomic/store/store
6661                                                             atomic/atomicrmw.
6662                                                           - Ensures any
6663                                                             following global
6664                                                             data read is no
6665                                                             older than the local load
6666                                                             atomic value being
6667                                                             acquired.
6668
6669     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6670
6671                                                           - If not TgSplit execution
6672                                                             mode, omit glc=1.
6673
6674                                                         2. s_waitcnt lgkm/vmcnt(0)
6675
6676                                                           - Use lgkmcnt(0) if not
6677                                                             TgSplit execution mode
6678                                                             and vmcnt(0) if TgSplit
6679                                                             execution mode.
6680                                                           - If OpenCL, omit lgkmcnt(0).
6681                                                           - Must happen before
6682                                                             the following
6683                                                             buffer_wbinvl1_vol and any
6684                                                             following global/generic
6685                                                             load/load
6686                                                             atomic/store/store
6687                                                             atomic/atomicrmw.
6688                                                           - Ensures any
6689                                                             following global
6690                                                             data read is no
6691                                                             older than a local load
6692                                                             atomic value being
6693                                                             acquired.
6694
6695                                                         3. buffer_wbinvl1_vol
6696
6697                                                           - If not TgSplit execution
6698                                                             mode, omit.
6699                                                           - Ensures that
6700                                                             following
6701                                                             loads will not see
6702                                                             stale data.
6703
6704     load atomic  acquire      - agent        - global   1. buffer/global_load
6705                                                            glc=1
6706                                                         2. s_waitcnt vmcnt(0)
6707
6708                                                           - Must happen before
6709                                                             following
6710                                                             buffer_wbinvl1_vol.
6711                                                           - Ensures the load
6712                                                             has completed
6713                                                             before invalidating
6714                                                             the cache.
6715
6716                                                         3. buffer_wbinvl1_vol
6717
6718                                                           - Must happen before
6719                                                             any following
6720                                                             global/generic
6721                                                             load/load
6722                                                             atomic/atomicrmw.
6723                                                           - Ensures that
6724                                                             following
6725                                                             loads will not see
6726                                                             stale global data.
6727
6728     load atomic  acquire      - system       - global   1. buffer/global/flat_load
6729                                                            glc=1
6730                                                         2. s_waitcnt vmcnt(0)
6731
6732                                                           - Must happen before
6733                                                             following buffer_invl2 and
6734                                                             buffer_wbinvl1_vol.
6735                                                           - Ensures the load
6736                                                             has completed
6737                                                             before invalidating
6738                                                             the cache.
6739
6740                                                         3. buffer_invl2;
6741                                                            buffer_wbinvl1_vol
6742
6743                                                           - Must happen before
6744                                                             any following
6745                                                             global/generic
6746                                                             load/load
6747                                                             atomic/atomicrmw.
6748                                                           - Ensures that
6749                                                             following
6750                                                             loads will not see
6751                                                             stale L1 global data,
6752                                                             nor see stale L2 MTYPE
6753                                                             NC global data.
6754                                                             MTYPE RW and CC memory will
6755                                                             never be stale in L2 due to
6756                                                             the memory probes.
6757
6758     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6759                                                         2. s_waitcnt vmcnt(0) &
6760                                                            lgkmcnt(0)
6761
6762                                                           - If TgSplit execution mode,
6763                                                             omit lgkmcnt(0).
6764                                                           - If OpenCL omit
6765                                                             lgkmcnt(0).
6766                                                           - Must happen before
6767                                                             following
6768                                                             buffer_wbinvl1_vol.
6769                                                           - Ensures the flat_load
6770                                                             has completed
6771                                                             before invalidating
6772                                                             the cache.
6773
6774                                                         3. buffer_wbinvl1_vol
6775
6776                                                           - Must happen before
6777                                                             any following
6778                                                             global/generic
6779                                                             load/load
6780                                                             atomic/atomicrmw.
6781                                                           - Ensures that
6782                                                             following loads
6783                                                             will not see stale
6784                                                             global data.
6785
6786     load atomic  acquire      - system       - generic  1. flat_load glc=1
6787                                                         2. s_waitcnt vmcnt(0) &
6788                                                            lgkmcnt(0)
6789
6790                                                           - If TgSplit execution mode,
6791                                                             omit lgkmcnt(0).
6792                                                           - If OpenCL omit
6793                                                             lgkmcnt(0).
6794                                                           - Must happen before
6795                                                             following
6796                                                             buffer_invl2 and
6797                                                             buffer_wbinvl1_vol.
6798                                                           - Ensures the flat_load
6799                                                             has completed
6800                                                             before invalidating
6801                                                             the caches.
6802
6803                                                         3. buffer_invl2;
6804                                                            buffer_wbinvl1_vol
6805
6806                                                           - Must happen before
6807                                                             any following
6808                                                             global/generic
6809                                                             load/load
6810                                                             atomic/atomicrmw.
6811                                                           - Ensures that
6812                                                             following
6813                                                             loads will not see
6814                                                             stale L1 global data,
6815                                                             nor see stale L2 MTYPE
6816                                                             NC global data.
6817                                                             MTYPE RW and CC memory will
6818                                                             never be stale in L2 due to
6819                                                             the memory probes.
6820
6821     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6822                               - wavefront    - generic
6823     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6824                               - wavefront               local address space cannot
6825                                                         be used.*
6826
6827                                                         1. ds_atomic
6828     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6829                                                         2. s_waitcnt vmcnt(0)
6830
6831                                                           - If not TgSplit execution
6832                                                             mode, omit.
6833                                                           - Must happen before the
6834                                                             following buffer_wbinvl1_vol.
6835                                                           - Ensures the atomicrmw
6836                                                             has completed
6837                                                             before invalidating
6838                                                             the cache.
6839
6840                                                         3. buffer_wbinvl1_vol
6841
6842                                                           - If not TgSplit execution
6843                                                             mode, omit.
6844                                                           - Must happen before
6845                                                             any following
6846                                                             global/generic
6847                                                             load/load
6848                                                             atomic/atomicrmw.
6849                                                           - Ensures that
6850                                                             following loads
6851                                                             will not see stale
6852                                                             global data.
6853
6854     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6855                                                         local address space cannot
6856                                                         be used.*
6857
6858                                                         1. ds_atomic
6859                                                         2. s_waitcnt lgkmcnt(0)
6860
6861                                                           - If OpenCL, omit.
6862                                                           - Must happen before
6863                                                             any following
6864                                                             global/generic
6865                                                             load/load
6866                                                             atomic/store/store
6867                                                             atomic/atomicrmw.
6868                                                           - Ensures any
6869                                                             following global
6870                                                             data read is no
6871                                                             older than the local
6872                                                             atomicrmw value
6873                                                             being acquired.
6874
6875     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6876                                                         2. s_waitcnt lgkm/vmcnt(0)
6877
6878                                                           - Use lgkmcnt(0) if not
6879                                                             TgSplit execution mode
6880                                                             and vmcnt(0) if TgSplit
6881                                                             execution mode.
6882                                                           - If OpenCL, omit lgkmcnt(0).
6883                                                           - Must happen before
6884                                                             the following
6885                                                             buffer_wbinvl1_vol and
6886                                                             any following
6887                                                             global/generic
6888                                                             load/load
6889                                                             atomic/store/store
6890                                                             atomic/atomicrmw.
6891                                                           - Ensures any
6892                                                             following global
6893                                                             data read is no
6894                                                             older than a local
6895                                                             atomicrmw value
6896                                                             being acquired.
6897
6898                                                         3. buffer_wbinvl1_vol
6899
6900                                                           - If not TgSplit execution
6901                                                             mode, omit.
6902                                                           - Ensures that
6903                                                             following
6904                                                             loads will not see
6905                                                             stale data.
6906
6907     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6908                                                         2. s_waitcnt vmcnt(0)
6909
6910                                                           - Must happen before
6911                                                             following
6912                                                             buffer_wbinvl1_vol.
6913                                                           - Ensures the
6914                                                             atomicrmw has
6915                                                             completed before
6916                                                             invalidating the
6917                                                             cache.
6918
6919                                                         3. buffer_wbinvl1_vol
6920
6921                                                           - Must happen before
6922                                                             any following
6923                                                             global/generic
6924                                                             load/load
6925                                                             atomic/atomicrmw.
6926                                                           - Ensures that
6927                                                             following loads
6928                                                             will not see stale
6929                                                             global data.
6930
6931     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6932                                                         2. s_waitcnt vmcnt(0)
6933
6934                                                           - Must happen before
6935                                                             following buffer_invl2 and
6936                                                             buffer_wbinvl1_vol.
6937                                                           - Ensures the
6938                                                             atomicrmw has
6939                                                             completed before
6940                                                             invalidating the
6941                                                             caches.
6942
6943                                                         3. buffer_invl2;
6944                                                            buffer_wbinvl1_vol
6945
6946                                                           - Must happen before
6947                                                             any following
6948                                                             global/generic
6949                                                             load/load
6950                                                             atomic/atomicrmw.
6951                                                           - Ensures that
6952                                                             following
6953                                                             loads will not see
6954                                                             stale L1 global data,
6955                                                             nor see stale L2 MTYPE
6956                                                             NC global data.
6957                                                             MTYPE RW and CC memory will
6958                                                             never be stale in L2 due to
6959                                                             the memory probes.
6960
6961     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6962                                                         2. s_waitcnt vmcnt(0) &
6963                                                            lgkmcnt(0)
6964
6965                                                           - If TgSplit execution mode,
6966                                                             omit lgkmcnt(0).
6967                                                           - If OpenCL, omit
6968                                                             lgkmcnt(0).
6969                                                           - Must happen before
6970                                                             following
6971                                                             buffer_wbinvl1_vol.
6972                                                           - Ensures the
6973                                                             atomicrmw has
6974                                                             completed before
6975                                                             invalidating the
6976                                                             cache.
6977
6978                                                         3. buffer_wbinvl1_vol
6979
6980                                                           - Must happen before
6981                                                             any following
6982                                                             global/generic
6983                                                             load/load
6984                                                             atomic/atomicrmw.
6985                                                           - Ensures that
6986                                                             following loads
6987                                                             will not see stale
6988                                                             global data.
6989
6990     atomicrmw    acquire      - system       - generic  1. flat_atomic
6991                                                         2. s_waitcnt vmcnt(0) &
6992                                                            lgkmcnt(0)
6993
6994                                                           - If TgSplit execution mode,
6995                                                             omit lgkmcnt(0).
6996                                                           - If OpenCL, omit
6997                                                             lgkmcnt(0).
6998                                                           - Must happen before
6999                                                             following
7000                                                             buffer_invl2 and
7001                                                             buffer_wbinvl1_vol.
7002                                                           - Ensures the
7003                                                             atomicrmw has
7004                                                             completed before
7005                                                             invalidating the
7006                                                             caches.
7007
7008                                                         3. buffer_invl2;
7009                                                            buffer_wbinvl1_vol
7010
7011                                                           - Must happen before
7012                                                             any following
7013                                                             global/generic
7014                                                             load/load
7015                                                             atomic/atomicrmw.
7016                                                           - Ensures that
7017                                                             following
7018                                                             loads will not see
7019                                                             stale L1 global data,
7020                                                             nor see stale L2 MTYPE
7021                                                             NC global data.
7022                                                             MTYPE RW and CC memory will
7023                                                             never be stale in L2 due to
7024                                                             the memory probes.
7025
7026     fence        acquire      - singlethread *none*     *none*
7027                               - wavefront
7028     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7029
7030                                                           - Use lgkmcnt(0) if not
7031                                                             TgSplit execution mode
7032                                                             and vmcnt(0) if TgSplit
7033                                                             execution mode.
7034                                                           - If OpenCL and
7035                                                             address space is
7036                                                             not generic, omit
7037                                                             lgkmcnt(0).
7038                                                           - If OpenCL and
7039                                                             address space is
7040                                                             local, omit
7041                                                             vmcnt(0).
7042                                                           - However, since LLVM
7043                                                             currently has no
7044                                                             address space on
7045                                                             the fence need to
7046                                                             conservatively
7047                                                             always generate. If
7048                                                             fence had an
7049                                                             address space then
7050                                                             set to address
7051                                                             space of OpenCL
7052                                                             fence flag, or to
7053                                                             generic if both
7054                                                             local and global
7055                                                             flags are
7056                                                             specified.
7057                                                           - s_waitcnt vmcnt(0)
7058                                                             must happen after
7059                                                             any preceding
7060                                                             global/generic load
7061                                                             atomic/
7062                                                             atomicrmw
7063                                                             with an equal or
7064                                                             wider sync scope
7065                                                             and memory ordering
7066                                                             stronger than
7067                                                             unordered (this is
7068                                                             termed the
7069                                                             fence-paired-atomic).
7070                                                           - s_waitcnt lgkmcnt(0)
7071                                                             must happen after
7072                                                             any preceding
7073                                                             local/generic load
7074                                                             atomic/atomicrmw
7075                                                             with an equal or
7076                                                             wider sync scope
7077                                                             and memory ordering
7078                                                             stronger than
7079                                                             unordered (this is
7080                                                             termed the
7081                                                             fence-paired-atomic).
7082                                                           - Must happen before
7083                                                             the following
7084                                                             buffer_wbinvl1_vol and
7085                                                             any following
7086                                                             global/generic
7087                                                             load/load
7088                                                             atomic/store/store
7089                                                             atomic/atomicrmw.
7090                                                           - Ensures any
7091                                                             following global
7092                                                             data read is no
7093                                                             older than the
7094                                                             value read by the
7095                                                             fence-paired-atomic.
7096
7097                                                         2. buffer_wbinvl1_vol
7098
7099                                                           - If not TgSplit execution
7100                                                             mode, omit.
7101                                                           - Ensures that
7102                                                             following
7103                                                             loads will not see
7104                                                             stale data.
7105
7106     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7107                                                            vmcnt(0)
7108
7109                                                           - If TgSplit execution mode,
7110                                                             omit lgkmcnt(0).
7111                                                           - If OpenCL and
7112                                                             address space is
7113                                                             not generic, omit
7114                                                             lgkmcnt(0).
7115                                                           - However, since LLVM
7116                                                             currently has no
7117                                                             address space on
7118                                                             the fence need to
7119                                                             conservatively
7120                                                             always generate
7121                                                             (see comment for
7122                                                             previous fence).
7123                                                           - Could be split into
7124                                                             separate s_waitcnt
7125                                                             vmcnt(0) and
7126                                                             s_waitcnt
7127                                                             lgkmcnt(0) to allow
7128                                                             them to be
7129                                                             independently moved
7130                                                             according to the
7131                                                             following rules.
7132                                                           - s_waitcnt vmcnt(0)
7133                                                             must happen after
7134                                                             any preceding
7135                                                             global/generic load
7136                                                             atomic/atomicrmw
7137                                                             with an equal or
7138                                                             wider sync scope
7139                                                             and memory ordering
7140                                                             stronger than
7141                                                             unordered (this is
7142                                                             termed the
7143                                                             fence-paired-atomic).
7144                                                           - s_waitcnt lgkmcnt(0)
7145                                                             must happen after
7146                                                             any preceding
7147                                                             local/generic load
7148                                                             atomic/atomicrmw
7149                                                             with an equal or
7150                                                             wider sync scope
7151                                                             and memory ordering
7152                                                             stronger than
7153                                                             unordered (this is
7154                                                             termed the
7155                                                             fence-paired-atomic).
7156                                                           - Must happen before
7157                                                             the following
7158                                                             buffer_wbinvl1_vol.
7159                                                           - Ensures that the
7160                                                             fence-paired atomic
7161                                                             has completed
7162                                                             before invalidating
7163                                                             the
7164                                                             cache. Therefore
7165                                                             any following
7166                                                             locations read must
7167                                                             be no older than
7168                                                             the value read by
7169                                                             the
7170                                                             fence-paired-atomic.
7171
7172                                                         2. buffer_wbinvl1_vol
7173
7174                                                           - Must happen before any
7175                                                             following global/generic
7176                                                             load/load
7177                                                             atomic/store/store
7178                                                             atomic/atomicrmw.
7179                                                           - Ensures that
7180                                                             following loads
7181                                                             will not see stale
7182                                                             global data.
7183
7184     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
7185                                                            vmcnt(0)
7186
7187                                                           - If TgSplit execution mode,
7188                                                             omit lgkmcnt(0).
7189                                                           - If OpenCL and
7190                                                             address space is
7191                                                             not generic, omit
7192                                                             lgkmcnt(0).
7193                                                           - However, since LLVM
7194                                                             currently has no
7195                                                             address space on
7196                                                             the fence need to
7197                                                             conservatively
7198                                                             always generate
7199                                                             (see comment for
7200                                                             previous fence).
7201                                                           - Could be split into
7202                                                             separate s_waitcnt
7203                                                             vmcnt(0) and
7204                                                             s_waitcnt
7205                                                             lgkmcnt(0) to allow
7206                                                             them to be
7207                                                             independently moved
7208                                                             according to the
7209                                                             following rules.
7210                                                           - s_waitcnt vmcnt(0)
7211                                                             must happen after
7212                                                             any preceding
7213                                                             global/generic load
7214                                                             atomic/atomicrmw
7215                                                             with an equal or
7216                                                             wider sync scope
7217                                                             and memory ordering
7218                                                             stronger than
7219                                                             unordered (this is
7220                                                             termed the
7221                                                             fence-paired-atomic).
7222                                                           - s_waitcnt lgkmcnt(0)
7223                                                             must happen after
7224                                                             any preceding
7225                                                             local/generic load
7226                                                             atomic/atomicrmw
7227                                                             with an equal or
7228                                                             wider sync scope
7229                                                             and memory ordering
7230                                                             stronger than
7231                                                             unordered (this is
7232                                                             termed the
7233                                                             fence-paired-atomic).
7234                                                           - Must happen before
7235                                                             the following buffer_invl2 and
7236                                                             buffer_wbinvl1_vol.
7237                                                           - Ensures that the
7238                                                             fence-paired atomic
7239                                                             has completed
7240                                                             before invalidating
7241                                                             the
7242                                                             cache. Therefore
7243                                                             any following
7244                                                             locations read must
7245                                                             be no older than
7246                                                             the value read by
7247                                                             the
7248                                                             fence-paired-atomic.
7249
7250                                                         2. buffer_invl2;
7251                                                            buffer_wbinvl1_vol
7252
7253                                                           - Must happen before any
7254                                                             following global/generic
7255                                                             load/load
7256                                                             atomic/store/store
7257                                                             atomic/atomicrmw.
7258                                                           - Ensures that
7259                                                             following
7260                                                             loads will not see
7261                                                             stale L1 global data,
7262                                                             nor see stale L2 MTYPE
7263                                                             NC global data.
7264                                                             MTYPE RW and CC memory will
7265                                                             never be stale in L2 due to
7266                                                             the memory probes.
7267     **Release Atomic**
7268     ------------------------------------------------------------------------------------
7269     store atomic release      - singlethread - global   1. buffer/global/flat_store
7270                               - wavefront    - generic
7271     store atomic release      - singlethread - local    *If TgSplit execution mode,
7272                               - wavefront               local address space cannot
7273                                                         be used.*
7274
7275                                                         1. ds_store
7276     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7277                                              - generic
7278                                                           - Use lgkmcnt(0) if not
7279                                                             TgSplit execution mode
7280                                                             and vmcnt(0) if TgSplit
7281                                                             execution mode.
7282                                                           - If OpenCL, omit lgkmcnt(0).
7283                                                           - s_waitcnt vmcnt(0)
7284                                                             must happen after
7285                                                             any preceding
7286                                                             global/generic load/store/
7287                                                             load atomic/store atomic/
7288                                                             atomicrmw.
7289                                                           - s_waitcnt lgkmcnt(0)
7290                                                             must happen after
7291                                                             any preceding
7292                                                             local/generic
7293                                                             load/store/load
7294                                                             atomic/store
7295                                                             atomic/atomicrmw.
7296                                                           - Must happen before
7297                                                             the following
7298                                                             store.
7299                                                           - Ensures that all
7300                                                             memory operations
7301                                                             have
7302                                                             completed before
7303                                                             performing the
7304                                                             store that is being
7305                                                             released.
7306
7307                                                         2. buffer/global/flat_store
7308     store atomic release      - workgroup    - local    *If TgSplit execution mode,
7309                                                         local address space cannot
7310                                                         be used.*
7311
7312                                                         1. ds_store
7313     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7314                                              - generic     vmcnt(0)
7315
7316                                                           - If TgSplit execution mode,
7317                                                             omit lgkmcnt(0).
7318                                                           - If OpenCL and
7319                                                             address space is
7320                                                             not generic, omit
7321                                                             lgkmcnt(0).
7322                                                           - Could be split into
7323                                                             separate s_waitcnt
7324                                                             vmcnt(0) and
7325                                                             s_waitcnt
7326                                                             lgkmcnt(0) to allow
7327                                                             them to be
7328                                                             independently moved
7329                                                             according to the
7330                                                             following rules.
7331                                                           - s_waitcnt vmcnt(0)
7332                                                             must happen after
7333                                                             any preceding
7334                                                             global/generic
7335                                                             load/store/load
7336                                                             atomic/store
7337                                                             atomic/atomicrmw.
7338                                                           - s_waitcnt lgkmcnt(0)
7339                                                             must happen after
7340                                                             any preceding
7341                                                             local/generic
7342                                                             load/store/load
7343                                                             atomic/store
7344                                                             atomic/atomicrmw.
7345                                                           - Must happen before
7346                                                             the following
7347                                                             store.
7348                                                           - Ensures that all
7349                                                             memory operations
7350                                                             to memory have
7351                                                             completed before
7352                                                             performing the
7353                                                             store that is being
7354                                                             released.
7355
7356                                                         2. buffer/global/flat_store
7357     store atomic release      - system       - global   1. buffer_wbl2
7358                                              - generic
7359                                                           - Must happen before
7360                                                             following s_waitcnt.
7361                                                           - Performs L2 writeback to
7362                                                             ensure previous
7363                                                             global/generic
7364                                                             store/atomicrmw are
7365                                                             visible at system scope.
7366
7367                                                         2. s_waitcnt lgkmcnt(0) &
7368                                                            vmcnt(0)
7369
7370                                                           - If TgSplit execution mode,
7371                                                             omit lgkmcnt(0).
7372                                                           - If OpenCL and
7373                                                             address space is
7374                                                             not generic, omit
7375                                                             lgkmcnt(0).
7376                                                           - Could be split into
7377                                                             separate s_waitcnt
7378                                                             vmcnt(0) and
7379                                                             s_waitcnt
7380                                                             lgkmcnt(0) to allow
7381                                                             them to be
7382                                                             independently moved
7383                                                             according to the
7384                                                             following rules.
7385                                                           - s_waitcnt vmcnt(0)
7386                                                             must happen after any
7387                                                             preceding
7388                                                             global/generic
7389                                                             load/store/load
7390                                                             atomic/store
7391                                                             atomic/atomicrmw.
7392                                                           - s_waitcnt lgkmcnt(0)
7393                                                             must happen after any
7394                                                             preceding
7395                                                             local/generic
7396                                                             load/store/load
7397                                                             atomic/store
7398                                                             atomic/atomicrmw.
7399                                                           - Must happen before
7400                                                             the following
7401                                                             store.
7402                                                           - Ensures that all
7403                                                             memory operations
7404                                                             to memory and the L2
7405                                                             writeback have
7406                                                             completed before
7407                                                             performing the
7408                                                             store that is being
7409                                                             released.
7410
7411                                                         3. buffer/global/flat_store
7412     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7413                               - wavefront    - generic
7414     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7415                               - wavefront               local address space cannot
7416                                                         be used.*
7417
7418                                                         1. ds_atomic
7419     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7420                                              - generic
7421                                                           - Use lgkmcnt(0) if not
7422                                                             TgSplit execution mode
7423                                                             and vmcnt(0) if TgSplit
7424                                                             execution mode.
7425                                                           - If OpenCL, omit
7426                                                             lgkmcnt(0).
7427                                                           - s_waitcnt vmcnt(0)
7428                                                             must happen after
7429                                                             any preceding
7430                                                             global/generic load/store/
7431                                                             load atomic/store atomic/
7432                                                             atomicrmw.
7433                                                           - s_waitcnt lgkmcnt(0)
7434                                                             must happen after
7435                                                             any preceding
7436                                                             local/generic
7437                                                             load/store/load
7438                                                             atomic/store
7439                                                             atomic/atomicrmw.
7440                                                           - Must happen before
7441                                                             the following
7442                                                             atomicrmw.
7443                                                           - Ensures that all
7444                                                             memory operations
7445                                                             have
7446                                                             completed before
7447                                                             performing the
7448                                                             atomicrmw that is
7449                                                             being released.
7450
7451                                                         2. buffer/global/flat_atomic
7452     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7453                                                         local address space cannot
7454                                                         be used.*
7455
7456                                                         1. ds_atomic
7457     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7458                                              - generic     vmcnt(0)
7459
7460                                                           - If TgSplit execution mode,
7461                                                             omit lgkmcnt(0).
7462                                                           - If OpenCL, omit
7463                                                             lgkmcnt(0).
7464                                                           - Could be split into
7465                                                             separate s_waitcnt
7466                                                             vmcnt(0) and
7467                                                             s_waitcnt
7468                                                             lgkmcnt(0) to allow
7469                                                             them to be
7470                                                             independently moved
7471                                                             according to the
7472                                                             following rules.
7473                                                           - s_waitcnt vmcnt(0)
7474                                                             must happen after
7475                                                             any preceding
7476                                                             global/generic
7477                                                             load/store/load
7478                                                             atomic/store
7479                                                             atomic/atomicrmw.
7480                                                           - s_waitcnt lgkmcnt(0)
7481                                                             must happen after
7482                                                             any preceding
7483                                                             local/generic
7484                                                             load/store/load
7485                                                             atomic/store
7486                                                             atomic/atomicrmw.
7487                                                           - Must happen before
7488                                                             the following
7489                                                             atomicrmw.
7490                                                           - Ensures that all
7491                                                             memory operations
7492                                                             to global and local
7493                                                             have completed
7494                                                             before performing
7495                                                             the atomicrmw that
7496                                                             is being released.
7497
7498                                                         2. buffer/global/flat_atomic
7499     atomicrmw    release      - system       - global   1. buffer_wbl2
7500                                              - generic
7501                                                           - Must happen before
7502                                                             following s_waitcnt.
7503                                                           - Performs L2 writeback to
7504                                                             ensure previous
7505                                                             global/generic
7506                                                             store/atomicrmw are
7507                                                             visible at system scope.
7508
7509                                                         2. s_waitcnt lgkmcnt(0) &
7510                                                            vmcnt(0)
7511
7512                                                           - If TgSplit execution mode,
7513                                                             omit lgkmcnt(0).
7514                                                           - If OpenCL, omit
7515                                                             lgkmcnt(0).
7516                                                           - Could be split into
7517                                                             separate s_waitcnt
7518                                                             vmcnt(0) and
7519                                                             s_waitcnt
7520                                                             lgkmcnt(0) to allow
7521                                                             them to be
7522                                                             independently moved
7523                                                             according to the
7524                                                             following rules.
7525                                                           - s_waitcnt vmcnt(0)
7526                                                             must happen after
7527                                                             any preceding
7528                                                             global/generic
7529                                                             load/store/load
7530                                                             atomic/store
7531                                                             atomic/atomicrmw.
7532                                                           - s_waitcnt lgkmcnt(0)
7533                                                             must happen after
7534                                                             any preceding
7535                                                             local/generic
7536                                                             load/store/load
7537                                                             atomic/store
7538                                                             atomic/atomicrmw.
7539                                                           - Must happen before
7540                                                             the following
7541                                                             atomicrmw.
7542                                                           - Ensures that all
7543                                                             memory operations
7544                                                             to memory and the L2
7545                                                             writeback have
7546                                                             completed before
7547                                                             performing the
7548                                                             store that is being
7549                                                             released.
7550
7551                                                         3. buffer/global/flat_atomic
7552     fence        release      - singlethread *none*     *none*
7553                               - wavefront
7554     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7555
7556                                                           - Use lgkmcnt(0) if not
7557                                                             TgSplit execution mode
7558                                                             and vmcnt(0) if TgSplit
7559                                                             execution mode.
7560                                                           - If OpenCL and
7561                                                             address space is
7562                                                             not generic, omit
7563                                                             lgkmcnt(0).
7564                                                           - If OpenCL and
7565                                                             address space is
7566                                                             local, omit
7567                                                             vmcnt(0).
7568                                                           - However, since LLVM
7569                                                             currently has no
7570                                                             address space on
7571                                                             the fence need to
7572                                                             conservatively
7573                                                             always generate. If
7574                                                             fence had an
7575                                                             address space then
7576                                                             set to address
7577                                                             space of OpenCL
7578                                                             fence flag, or to
7579                                                             generic if both
7580                                                             local and global
7581                                                             flags are
7582                                                             specified.
7583                                                           - s_waitcnt vmcnt(0)
7584                                                             must happen after
7585                                                             any preceding
7586                                                             global/generic
7587                                                             load/store/
7588                                                             load atomic/store atomic/
7589                                                             atomicrmw.
7590                                                           - s_waitcnt lgkmcnt(0)
7591                                                             must happen after
7592                                                             any preceding
7593                                                             local/generic
7594                                                             load/load
7595                                                             atomic/store/store
7596                                                             atomic/atomicrmw.
7597                                                           - Must happen before
7598                                                             any following store
7599                                                             atomic/atomicrmw
7600                                                             with an equal or
7601                                                             wider sync scope
7602                                                             and memory ordering
7603                                                             stronger than
7604                                                             unordered (this is
7605                                                             termed the
7606                                                             fence-paired-atomic).
7607                                                           - Ensures that all
7608                                                             memory operations
7609                                                             have
7610                                                             completed before
7611                                                             performing the
7612                                                             following
7613                                                             fence-paired-atomic.
7614
7615     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7616                                                            vmcnt(0)
7617
7618                                                           - If TgSplit execution mode,
7619                                                             omit lgkmcnt(0).
7620                                                           - If OpenCL and
7621                                                             address space is
7622                                                             not generic, omit
7623                                                             lgkmcnt(0).
7624                                                           - If OpenCL and
7625                                                             address space is
7626                                                             local, omit
7627                                                             vmcnt(0).
7628                                                           - However, since LLVM
7629                                                             currently has no
7630                                                             address space on
7631                                                             the fence need to
7632                                                             conservatively
7633                                                             always generate. If
7634                                                             fence had an
7635                                                             address space then
7636                                                             set to address
7637                                                             space of OpenCL
7638                                                             fence flag, or to
7639                                                             generic if both
7640                                                             local and global
7641                                                             flags are
7642                                                             specified.
7643                                                           - Could be split into
7644                                                             separate s_waitcnt
7645                                                             vmcnt(0) and
7646                                                             s_waitcnt
7647                                                             lgkmcnt(0) to allow
7648                                                             them to be
7649                                                             independently moved
7650                                                             according to the
7651                                                             following rules.
7652                                                           - s_waitcnt vmcnt(0)
7653                                                             must happen after
7654                                                             any preceding
7655                                                             global/generic
7656                                                             load/store/load
7657                                                             atomic/store
7658                                                             atomic/atomicrmw.
7659                                                           - s_waitcnt lgkmcnt(0)
7660                                                             must happen after
7661                                                             any preceding
7662                                                             local/generic
7663                                                             load/store/load
7664                                                             atomic/store
7665                                                             atomic/atomicrmw.
7666                                                           - Must happen before
7667                                                             any following store
7668                                                             atomic/atomicrmw
7669                                                             with an equal or
7670                                                             wider sync scope
7671                                                             and memory ordering
7672                                                             stronger than
7673                                                             unordered (this is
7674                                                             termed the
7675                                                             fence-paired-atomic).
7676                                                           - Ensures that all
7677                                                             memory operations
7678                                                             have
7679                                                             completed before
7680                                                             performing the
7681                                                             following
7682                                                             fence-paired-atomic.
7683
7684     fence        release      - system       *none*     1. buffer_wbl2
7685
7686                                                           - If OpenCL and
7687                                                             address space is
7688                                                             local, omit.
7689                                                           - Must happen before
7690                                                             following s_waitcnt.
7691                                                           - Performs L2 writeback to
7692                                                             ensure previous
7693                                                             global/generic
7694                                                             store/atomicrmw are
7695                                                             visible at system scope.
7696
7697                                                         2. s_waitcnt lgkmcnt(0) &
7698                                                            vmcnt(0)
7699
7700                                                           - If TgSplit execution mode,
7701                                                             omit lgkmcnt(0).
7702                                                           - If OpenCL and
7703                                                             address space is
7704                                                             not generic, omit
7705                                                             lgkmcnt(0).
7706                                                           - If OpenCL and
7707                                                             address space is
7708                                                             local, omit
7709                                                             vmcnt(0).
7710                                                           - However, since LLVM
7711                                                             currently has no
7712                                                             address space on
7713                                                             the fence need to
7714                                                             conservatively
7715                                                             always generate. If
7716                                                             fence had an
7717                                                             address space then
7718                                                             set to address
7719                                                             space of OpenCL
7720                                                             fence flag, or to
7721                                                             generic if both
7722                                                             local and global
7723                                                             flags are
7724                                                             specified.
7725                                                           - Could be split into
7726                                                             separate s_waitcnt
7727                                                             vmcnt(0) and
7728                                                             s_waitcnt
7729                                                             lgkmcnt(0) to allow
7730                                                             them to be
7731                                                             independently moved
7732                                                             according to the
7733                                                             following rules.
7734                                                           - s_waitcnt vmcnt(0)
7735                                                             must happen after
7736                                                             any preceding
7737                                                             global/generic
7738                                                             load/store/load
7739                                                             atomic/store
7740                                                             atomic/atomicrmw.
7741                                                           - s_waitcnt lgkmcnt(0)
7742                                                             must happen after
7743                                                             any preceding
7744                                                             local/generic
7745                                                             load/store/load
7746                                                             atomic/store
7747                                                             atomic/atomicrmw.
7748                                                           - Must happen before
7749                                                             any following store
7750                                                             atomic/atomicrmw
7751                                                             with an equal or
7752                                                             wider sync scope
7753                                                             and memory ordering
7754                                                             stronger than
7755                                                             unordered (this is
7756                                                             termed the
7757                                                             fence-paired-atomic).
7758                                                           - Ensures that all
7759                                                             memory operations
7760                                                             have
7761                                                             completed before
7762                                                             performing the
7763                                                             following
7764                                                             fence-paired-atomic.
7765
7766     **Acquire-Release Atomic**
7767     ------------------------------------------------------------------------------------
7768     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7769                               - wavefront    - generic
7770     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7771                               - wavefront               local address space cannot
7772                                                         be used.*
7773
7774                                                         1. ds_atomic
7775     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7776
7777                                                           - Use lgkmcnt(0) if not
7778                                                             TgSplit execution mode
7779                                                             and vmcnt(0) if TgSplit
7780                                                             execution mode.
7781                                                           - If OpenCL, omit
7782                                                             lgkmcnt(0).
7783                                                           - Must happen after
7784                                                             any preceding
7785                                                             local/generic
7786                                                             load/store/load
7787                                                             atomic/store
7788                                                             atomic/atomicrmw.
7789                                                           - s_waitcnt vmcnt(0)
7790                                                             must happen after
7791                                                             any preceding
7792                                                             global/generic load/store/
7793                                                             load atomic/store atomic/
7794                                                             atomicrmw.
7795                                                           - s_waitcnt lgkmcnt(0)
7796                                                             must happen after
7797                                                             any preceding
7798                                                             local/generic
7799                                                             load/store/load
7800                                                             atomic/store
7801                                                             atomic/atomicrmw.
7802                                                           - Must happen before
7803                                                             the following
7804                                                             atomicrmw.
7805                                                           - Ensures that all
7806                                                             memory operations
7807                                                             have
7808                                                             completed before
7809                                                             performing the
7810                                                             atomicrmw that is
7811                                                             being released.
7812
7813                                                         2. buffer/global_atomic
7814                                                         3. s_waitcnt vmcnt(0)
7815
7816                                                           - If not TgSplit execution
7817                                                             mode, omit.
7818                                                           - Must happen before
7819                                                             the following
7820                                                             buffer_wbinvl1_vol.
7821                                                           - Ensures any
7822                                                             following global
7823                                                             data read is no
7824                                                             older than the
7825                                                             atomicrmw value
7826                                                             being acquired.
7827
7828                                                         4. buffer_wbinvl1_vol
7829
7830                                                           - If not TgSplit execution
7831                                                             mode, omit.
7832                                                           - Ensures that
7833                                                             following
7834                                                             loads will not see
7835                                                             stale data.
7836
7837     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7838                                                         local address space cannot
7839                                                         be used.*
7840
7841                                                         1. ds_atomic
7842                                                         2. s_waitcnt lgkmcnt(0)
7843
7844                                                           - If OpenCL, omit.
7845                                                           - Must happen before
7846                                                             any following
7847                                                             global/generic
7848                                                             load/load
7849                                                             atomic/store/store
7850                                                             atomic/atomicrmw.
7851                                                           - Ensures any
7852                                                             following global
7853                                                             data read is no
7854                                                             older than the local load
7855                                                             atomic value being
7856                                                             acquired.
7857
7858     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7859
7860                                                           - Use lgkmcnt(0) if not
7861                                                             TgSplit execution mode
7862                                                             and vmcnt(0) if TgSplit
7863                                                             execution mode.
7864                                                           - If OpenCL, omit
7865                                                             lgkmcnt(0).
7866                                                           - s_waitcnt vmcnt(0)
7867                                                             must happen after
7868                                                             any preceding
7869                                                             global/generic load/store/
7870                                                             load atomic/store atomic/
7871                                                             atomicrmw.
7872                                                           - s_waitcnt lgkmcnt(0)
7873                                                             must happen after
7874                                                             any preceding
7875                                                             local/generic
7876                                                             load/store/load
7877                                                             atomic/store
7878                                                             atomic/atomicrmw.
7879                                                           - Must happen before
7880                                                             the following
7881                                                             atomicrmw.
7882                                                           - Ensures that all
7883                                                             memory operations
7884                                                             have
7885                                                             completed before
7886                                                             performing the
7887                                                             atomicrmw that is
7888                                                             being released.
7889
7890                                                         2. flat_atomic
7891                                                         3. s_waitcnt lgkmcnt(0) &
7892                                                            vmcnt(0)
7893
7894                                                           - If not TgSplit execution
7895                                                             mode, omit vmcnt(0).
7896                                                           - If OpenCL, omit
7897                                                             lgkmcnt(0).
7898                                                           - Must happen before
7899                                                             the following
7900                                                             buffer_wbinvl1_vol and
7901                                                             any following
7902                                                             global/generic
7903                                                             load/load
7904                                                             atomic/store/store
7905                                                             atomic/atomicrmw.
7906                                                           - Ensures any
7907                                                             following global
7908                                                             data read is no
7909                                                             older than a local load
7910                                                             atomic value being
7911                                                             acquired.
7912
7913                                                         3. buffer_wbinvl1_vol
7914
7915                                                           - If not TgSplit execution
7916                                                             mode, omit.
7917                                                           - Ensures that
7918                                                             following
7919                                                             loads will not see
7920                                                             stale data.
7921
7922     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7923                                                            vmcnt(0)
7924
7925                                                           - If TgSplit execution mode,
7926                                                             omit lgkmcnt(0).
7927                                                           - If OpenCL, omit
7928                                                             lgkmcnt(0).
7929                                                           - Could be split into
7930                                                             separate s_waitcnt
7931                                                             vmcnt(0) and
7932                                                             s_waitcnt
7933                                                             lgkmcnt(0) to allow
7934                                                             them to be
7935                                                             independently moved
7936                                                             according to the
7937                                                             following rules.
7938                                                           - s_waitcnt vmcnt(0)
7939                                                             must happen after
7940                                                             any preceding
7941                                                             global/generic
7942                                                             load/store/load
7943                                                             atomic/store
7944                                                             atomic/atomicrmw.
7945                                                           - s_waitcnt lgkmcnt(0)
7946                                                             must happen after
7947                                                             any preceding
7948                                                             local/generic
7949                                                             load/store/load
7950                                                             atomic/store
7951                                                             atomic/atomicrmw.
7952                                                           - Must happen before
7953                                                             the following
7954                                                             atomicrmw.
7955                                                           - Ensures that all
7956                                                             memory operations
7957                                                             to global have
7958                                                             completed before
7959                                                             performing the
7960                                                             atomicrmw that is
7961                                                             being released.
7962
7963                                                         2. buffer/global_atomic
7964                                                         3. s_waitcnt vmcnt(0)
7965
7966                                                           - Must happen before
7967                                                             following
7968                                                             buffer_wbinvl1_vol.
7969                                                           - Ensures the
7970                                                             atomicrmw has
7971                                                             completed before
7972                                                             invalidating the
7973                                                             cache.
7974
7975                                                         4. buffer_wbinvl1_vol
7976
7977                                                           - Must happen before
7978                                                             any following
7979                                                             global/generic
7980                                                             load/load
7981                                                             atomic/atomicrmw.
7982                                                           - Ensures that
7983                                                             following loads
7984                                                             will not see stale
7985                                                             global data.
7986
7987     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7988
7989                                                           - Must happen before
7990                                                             following s_waitcnt.
7991                                                           - Performs L2 writeback to
7992                                                             ensure previous
7993                                                             global/generic
7994                                                             store/atomicrmw are
7995                                                             visible at system scope.
7996
7997                                                         2. s_waitcnt lgkmcnt(0) &
7998                                                            vmcnt(0)
7999
8000                                                           - If TgSplit execution mode,
8001                                                             omit lgkmcnt(0).
8002                                                           - If OpenCL, omit
8003                                                             lgkmcnt(0).
8004                                                           - Could be split into
8005                                                             separate s_waitcnt
8006                                                             vmcnt(0) and
8007                                                             s_waitcnt
8008                                                             lgkmcnt(0) to allow
8009                                                             them to be
8010                                                             independently moved
8011                                                             according to the
8012                                                             following rules.
8013                                                           - s_waitcnt vmcnt(0)
8014                                                             must happen after
8015                                                             any preceding
8016                                                             global/generic
8017                                                             load/store/load
8018                                                             atomic/store
8019                                                             atomic/atomicrmw.
8020                                                           - s_waitcnt lgkmcnt(0)
8021                                                             must happen after
8022                                                             any preceding
8023                                                             local/generic
8024                                                             load/store/load
8025                                                             atomic/store
8026                                                             atomic/atomicrmw.
8027                                                           - Must happen before
8028                                                             the following
8029                                                             atomicrmw.
8030                                                           - Ensures that all
8031                                                             memory operations
8032                                                             to global and L2 writeback
8033                                                             have completed before
8034                                                             performing the
8035                                                             atomicrmw that is
8036                                                             being released.
8037
8038                                                         3. buffer/global_atomic
8039                                                         4. s_waitcnt vmcnt(0)
8040
8041                                                           - Must happen before
8042                                                             following buffer_invl2 and
8043                                                             buffer_wbinvl1_vol.
8044                                                           - Ensures the
8045                                                             atomicrmw has
8046                                                             completed before
8047                                                             invalidating the
8048                                                             caches.
8049
8050                                                         5. buffer_invl2;
8051                                                            buffer_wbinvl1_vol
8052
8053                                                           - Must happen before
8054                                                             any following
8055                                                             global/generic
8056                                                             load/load
8057                                                             atomic/atomicrmw.
8058                                                           - Ensures that
8059                                                             following
8060                                                             loads will not see
8061                                                             stale L1 global data,
8062                                                             nor see stale L2 MTYPE
8063                                                             NC global data.
8064                                                             MTYPE RW and CC memory will
8065                                                             never be stale in L2 due to
8066                                                             the memory probes.
8067
8068     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
8069                                                            vmcnt(0)
8070
8071                                                           - If TgSplit execution mode,
8072                                                             omit lgkmcnt(0).
8073                                                           - If OpenCL, omit
8074                                                             lgkmcnt(0).
8075                                                           - Could be split into
8076                                                             separate s_waitcnt
8077                                                             vmcnt(0) and
8078                                                             s_waitcnt
8079                                                             lgkmcnt(0) to allow
8080                                                             them to be
8081                                                             independently moved
8082                                                             according to the
8083                                                             following rules.
8084                                                           - s_waitcnt vmcnt(0)
8085                                                             must happen after
8086                                                             any preceding
8087                                                             global/generic
8088                                                             load/store/load
8089                                                             atomic/store
8090                                                             atomic/atomicrmw.
8091                                                           - s_waitcnt lgkmcnt(0)
8092                                                             must happen after
8093                                                             any preceding
8094                                                             local/generic
8095                                                             load/store/load
8096                                                             atomic/store
8097                                                             atomic/atomicrmw.
8098                                                           - Must happen before
8099                                                             the following
8100                                                             atomicrmw.
8101                                                           - Ensures that all
8102                                                             memory operations
8103                                                             to global have
8104                                                             completed before
8105                                                             performing the
8106                                                             atomicrmw that is
8107                                                             being released.
8108
8109                                                         2. flat_atomic
8110                                                         3. s_waitcnt vmcnt(0) &
8111                                                            lgkmcnt(0)
8112
8113                                                           - If TgSplit execution mode,
8114                                                             omit lgkmcnt(0).
8115                                                           - If OpenCL, omit
8116                                                             lgkmcnt(0).
8117                                                           - Must happen before
8118                                                             following
8119                                                             buffer_wbinvl1_vol.
8120                                                           - Ensures the
8121                                                             atomicrmw has
8122                                                             completed before
8123                                                             invalidating the
8124                                                             cache.
8125
8126                                                         4. buffer_wbinvl1_vol
8127
8128                                                           - Must happen before
8129                                                             any following
8130                                                             global/generic
8131                                                             load/load
8132                                                             atomic/atomicrmw.
8133                                                           - Ensures that
8134                                                             following loads
8135                                                             will not see stale
8136                                                             global data.
8137
8138     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
8139
8140                                                           - Must happen before
8141                                                             following s_waitcnt.
8142                                                           - Performs L2 writeback to
8143                                                             ensure previous
8144                                                             global/generic
8145                                                             store/atomicrmw are
8146                                                             visible at system scope.
8147
8148                                                         2. s_waitcnt lgkmcnt(0) &
8149                                                            vmcnt(0)
8150
8151                                                           - If TgSplit execution mode,
8152                                                             omit lgkmcnt(0).
8153                                                           - If OpenCL, omit
8154                                                             lgkmcnt(0).
8155                                                           - Could be split into
8156                                                             separate s_waitcnt
8157                                                             vmcnt(0) and
8158                                                             s_waitcnt
8159                                                             lgkmcnt(0) to allow
8160                                                             them to be
8161                                                             independently moved
8162                                                             according to the
8163                                                             following rules.
8164                                                           - s_waitcnt vmcnt(0)
8165                                                             must happen after
8166                                                             any preceding
8167                                                             global/generic
8168                                                             load/store/load
8169                                                             atomic/store
8170                                                             atomic/atomicrmw.
8171                                                           - s_waitcnt lgkmcnt(0)
8172                                                             must happen after
8173                                                             any preceding
8174                                                             local/generic
8175                                                             load/store/load
8176                                                             atomic/store
8177                                                             atomic/atomicrmw.
8178                                                           - Must happen before
8179                                                             the following
8180                                                             atomicrmw.
8181                                                           - Ensures that all
8182                                                             memory operations
8183                                                             to global and L2 writeback
8184                                                             have completed before
8185                                                             performing the
8186                                                             atomicrmw that is
8187                                                             being released.
8188
8189                                                         3. flat_atomic
8190                                                         4. s_waitcnt vmcnt(0) &
8191                                                            lgkmcnt(0)
8192
8193                                                           - If TgSplit execution mode,
8194                                                             omit lgkmcnt(0).
8195                                                           - If OpenCL, omit
8196                                                             lgkmcnt(0).
8197                                                           - Must happen before
8198                                                             following buffer_invl2 and
8199                                                             buffer_wbinvl1_vol.
8200                                                           - Ensures the
8201                                                             atomicrmw has
8202                                                             completed before
8203                                                             invalidating the
8204                                                             caches.
8205
8206                                                         5. buffer_invl2;
8207                                                            buffer_wbinvl1_vol
8208
8209                                                           - Must happen before
8210                                                             any following
8211                                                             global/generic
8212                                                             load/load
8213                                                             atomic/atomicrmw.
8214                                                           - Ensures that
8215                                                             following
8216                                                             loads will not see
8217                                                             stale L1 global data,
8218                                                             nor see stale L2 MTYPE
8219                                                             NC global data.
8220                                                             MTYPE RW and CC memory will
8221                                                             never be stale in L2 due to
8222                                                             the memory probes.
8223
8224     fence        acq_rel      - singlethread *none*     *none*
8225                               - wavefront
8226     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8227
8228                                                           - Use lgkmcnt(0) if not
8229                                                             TgSplit execution mode
8230                                                             and vmcnt(0) if TgSplit
8231                                                             execution mode.
8232                                                           - If OpenCL and
8233                                                             address space is
8234                                                             not generic, omit
8235                                                             lgkmcnt(0).
8236                                                           - If OpenCL and
8237                                                             address space is
8238                                                             local, omit
8239                                                             vmcnt(0).
8240                                                           - However,
8241                                                             since LLVM
8242                                                             currently has no
8243                                                             address space on
8244                                                             the fence need to
8245                                                             conservatively
8246                                                             always generate
8247                                                             (see comment for
8248                                                             previous fence).
8249                                                           - s_waitcnt vmcnt(0)
8250                                                             must happen after
8251                                                             any preceding
8252                                                             global/generic
8253                                                             load/store/
8254                                                             load atomic/store atomic/
8255                                                             atomicrmw.
8256                                                           - s_waitcnt lgkmcnt(0)
8257                                                             must happen after
8258                                                             any preceding
8259                                                             local/generic
8260                                                             load/load
8261                                                             atomic/store/store
8262                                                             atomic/atomicrmw.
8263                                                           - Must happen before
8264                                                             any following
8265                                                             global/generic
8266                                                             load/load
8267                                                             atomic/store/store
8268                                                             atomic/atomicrmw.
8269                                                           - Ensures that all
8270                                                             memory operations
8271                                                             have
8272                                                             completed before
8273                                                             performing any
8274                                                             following global
8275                                                             memory operations.
8276                                                           - Ensures that the
8277                                                             preceding
8278                                                             local/generic load
8279                                                             atomic/atomicrmw
8280                                                             with an equal or
8281                                                             wider sync scope
8282                                                             and memory ordering
8283                                                             stronger than
8284                                                             unordered (this is
8285                                                             termed the
8286                                                             acquire-fence-paired-atomic)
8287                                                             has completed
8288                                                             before following
8289                                                             global memory
8290                                                             operations. This
8291                                                             satisfies the
8292                                                             requirements of
8293                                                             acquire.
8294                                                           - Ensures that all
8295                                                             previous memory
8296                                                             operations have
8297                                                             completed before a
8298                                                             following
8299                                                             local/generic store
8300                                                             atomic/atomicrmw
8301                                                             with an equal or
8302                                                             wider sync scope
8303                                                             and memory ordering
8304                                                             stronger than
8305                                                             unordered (this is
8306                                                             termed the
8307                                                             release-fence-paired-atomic).
8308                                                             This satisfies the
8309                                                             requirements of
8310                                                             release.
8311                                                           - Must happen before
8312                                                             the following
8313                                                             buffer_wbinvl1_vol.
8314                                                           - Ensures that the
8315                                                             acquire-fence-paired
8316                                                             atomic has completed
8317                                                             before invalidating
8318                                                             the
8319                                                             cache. Therefore
8320                                                             any following
8321                                                             locations read must
8322                                                             be no older than
8323                                                             the value read by
8324                                                             the
8325                                                             acquire-fence-paired-atomic.
8326
8327                                                         2. buffer_wbinvl1_vol
8328
8329                                                           - If not TgSplit execution
8330                                                             mode, omit.
8331                                                           - Ensures that
8332                                                             following
8333                                                             loads will not see
8334                                                             stale data.
8335
8336     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8337                                                            vmcnt(0)
8338
8339                                                           - If TgSplit execution mode,
8340                                                             omit lgkmcnt(0).
8341                                                           - If OpenCL and
8342                                                             address space is
8343                                                             not generic, omit
8344                                                             lgkmcnt(0).
8345                                                           - However, since LLVM
8346                                                             currently has no
8347                                                             address space on
8348                                                             the fence need to
8349                                                             conservatively
8350                                                             always generate
8351                                                             (see comment for
8352                                                             previous fence).
8353                                                           - Could be split into
8354                                                             separate s_waitcnt
8355                                                             vmcnt(0) and
8356                                                             s_waitcnt
8357                                                             lgkmcnt(0) to allow
8358                                                             them to be
8359                                                             independently moved
8360                                                             according to the
8361                                                             following rules.
8362                                                           - s_waitcnt vmcnt(0)
8363                                                             must happen after
8364                                                             any preceding
8365                                                             global/generic
8366                                                             load/store/load
8367                                                             atomic/store
8368                                                             atomic/atomicrmw.
8369                                                           - s_waitcnt lgkmcnt(0)
8370                                                             must happen after
8371                                                             any preceding
8372                                                             local/generic
8373                                                             load/store/load
8374                                                             atomic/store
8375                                                             atomic/atomicrmw.
8376                                                           - Must happen before
8377                                                             the following
8378                                                             buffer_wbinvl1_vol.
8379                                                           - Ensures that the
8380                                                             preceding
8381                                                             global/local/generic
8382                                                             load
8383                                                             atomic/atomicrmw
8384                                                             with an equal or
8385                                                             wider sync scope
8386                                                             and memory ordering
8387                                                             stronger than
8388                                                             unordered (this is
8389                                                             termed the
8390                                                             acquire-fence-paired-atomic)
8391                                                             has completed
8392                                                             before invalidating
8393                                                             the cache. This
8394                                                             satisfies the
8395                                                             requirements of
8396                                                             acquire.
8397                                                           - Ensures that all
8398                                                             previous memory
8399                                                             operations have
8400                                                             completed before a
8401                                                             following
8402                                                             global/local/generic
8403                                                             store
8404                                                             atomic/atomicrmw
8405                                                             with an equal or
8406                                                             wider sync scope
8407                                                             and memory ordering
8408                                                             stronger than
8409                                                             unordered (this is
8410                                                             termed the
8411                                                             release-fence-paired-atomic).
8412                                                             This satisfies the
8413                                                             requirements of
8414                                                             release.
8415
8416                                                         2. buffer_wbinvl1_vol
8417
8418                                                           - Must happen before
8419                                                             any following
8420                                                             global/generic
8421                                                             load/load
8422                                                             atomic/store/store
8423                                                             atomic/atomicrmw.
8424                                                           - Ensures that
8425                                                             following loads
8426                                                             will not see stale
8427                                                             global data. This
8428                                                             satisfies the
8429                                                             requirements of
8430                                                             acquire.
8431
8432     fence        acq_rel      - system       *none*     1. buffer_wbl2
8433
8434                                                           - If OpenCL and
8435                                                             address space is
8436                                                             local, omit.
8437                                                           - Must happen before
8438                                                             following s_waitcnt.
8439                                                           - Performs L2 writeback to
8440                                                             ensure previous
8441                                                             global/generic
8442                                                             store/atomicrmw are
8443                                                             visible at system scope.
8444
8445                                                         2. s_waitcnt lgkmcnt(0) &
8446                                                            vmcnt(0)
8447
8448                                                           - If TgSplit execution mode,
8449                                                             omit lgkmcnt(0).
8450                                                           - If OpenCL and
8451                                                             address space is
8452                                                             not generic, omit
8453                                                             lgkmcnt(0).
8454                                                           - However, since LLVM
8455                                                             currently has no
8456                                                             address space on
8457                                                             the fence need to
8458                                                             conservatively
8459                                                             always generate
8460                                                             (see comment for
8461                                                             previous fence).
8462                                                           - Could be split into
8463                                                             separate s_waitcnt
8464                                                             vmcnt(0) and
8465                                                             s_waitcnt
8466                                                             lgkmcnt(0) to allow
8467                                                             them to be
8468                                                             independently moved
8469                                                             according to the
8470                                                             following rules.
8471                                                           - s_waitcnt vmcnt(0)
8472                                                             must happen after
8473                                                             any preceding
8474                                                             global/generic
8475                                                             load/store/load
8476                                                             atomic/store
8477                                                             atomic/atomicrmw.
8478                                                           - s_waitcnt lgkmcnt(0)
8479                                                             must happen after
8480                                                             any preceding
8481                                                             local/generic
8482                                                             load/store/load
8483                                                             atomic/store
8484                                                             atomic/atomicrmw.
8485                                                           - Must happen before
8486                                                             the following buffer_invl2 and
8487                                                             buffer_wbinvl1_vol.
8488                                                           - Ensures that the
8489                                                             preceding
8490                                                             global/local/generic
8491                                                             load
8492                                                             atomic/atomicrmw
8493                                                             with an equal or
8494                                                             wider sync scope
8495                                                             and memory ordering
8496                                                             stronger than
8497                                                             unordered (this is
8498                                                             termed the
8499                                                             acquire-fence-paired-atomic)
8500                                                             has completed
8501                                                             before invalidating
8502                                                             the cache. This
8503                                                             satisfies the
8504                                                             requirements of
8505                                                             acquire.
8506                                                           - Ensures that all
8507                                                             previous memory
8508                                                             operations have
8509                                                             completed before a
8510                                                             following
8511                                                             global/local/generic
8512                                                             store
8513                                                             atomic/atomicrmw
8514                                                             with an equal or
8515                                                             wider sync scope
8516                                                             and memory ordering
8517                                                             stronger than
8518                                                             unordered (this is
8519                                                             termed the
8520                                                             release-fence-paired-atomic).
8521                                                             This satisfies the
8522                                                             requirements of
8523                                                             release.
8524
8525                                                         3.  buffer_invl2;
8526                                                             buffer_wbinvl1_vol
8527
8528                                                           - Must happen before
8529                                                             any following
8530                                                             global/generic
8531                                                             load/load
8532                                                             atomic/store/store
8533                                                             atomic/atomicrmw.
8534                                                           - Ensures that
8535                                                             following
8536                                                             loads will not see
8537                                                             stale L1 global data,
8538                                                             nor see stale L2 MTYPE
8539                                                             NC global data.
8540                                                             MTYPE RW and CC memory will
8541                                                             never be stale in L2 due to
8542                                                             the memory probes.
8543
8544     **Sequential Consistent Atomic**
8545     ------------------------------------------------------------------------------------
8546     load atomic  seq_cst      - singlethread - global   *Same as corresponding
8547                               - wavefront    - local    load atomic acquire,
8548                                              - generic  except must generate
8549                                                         all instructions even
8550                                                         for OpenCL.*
8551     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8552                                              - generic
8553                                                           - Use lgkmcnt(0) if not
8554                                                             TgSplit execution mode
8555                                                             and vmcnt(0) if TgSplit
8556                                                             execution mode.
8557                                                           - s_waitcnt lgkmcnt(0) must
8558                                                             happen after
8559                                                             preceding
8560                                                             local/generic load
8561                                                             atomic/store
8562                                                             atomic/atomicrmw
8563                                                             with memory
8564                                                             ordering of seq_cst
8565                                                             and with equal or
8566                                                             wider sync scope.
8567                                                             (Note that seq_cst
8568                                                             fences have their
8569                                                             own s_waitcnt
8570                                                             lgkmcnt(0) and so do
8571                                                             not need to be
8572                                                             considered.)
8573                                                           - s_waitcnt vmcnt(0)
8574                                                             must happen after
8575                                                             preceding
8576                                                             global/generic load
8577                                                             atomic/store
8578                                                             atomic/atomicrmw
8579                                                             with memory
8580                                                             ordering of seq_cst
8581                                                             and with equal or
8582                                                             wider sync scope.
8583                                                             (Note that seq_cst
8584                                                             fences have their
8585                                                             own s_waitcnt
8586                                                             vmcnt(0) and so do
8587                                                             not need to be
8588                                                             considered.)
8589                                                           - Ensures any
8590                                                             preceding
8591                                                             sequential
8592                                                             consistent global/local
8593                                                             memory instructions
8594                                                             have completed
8595                                                             before executing
8596                                                             this sequentially
8597                                                             consistent
8598                                                             instruction. This
8599                                                             prevents reordering
8600                                                             a seq_cst store
8601                                                             followed by a
8602                                                             seq_cst load. (Note
8603                                                             that seq_cst is
8604                                                             stronger than
8605                                                             acquire/release as
8606                                                             the reordering of
8607                                                             load acquire
8608                                                             followed by a store
8609                                                             release is
8610                                                             prevented by the
8611                                                             s_waitcnt of
8612                                                             the release, but
8613                                                             there is nothing
8614                                                             preventing a store
8615                                                             release followed by
8616                                                             load acquire from
8617                                                             completing out of
8618                                                             order. The s_waitcnt
8619                                                             could be placed after
8620                                                             seq_store or before
8621                                                             the seq_load. We
8622                                                             choose the load to
8623                                                             make the s_waitcnt be
8624                                                             as late as possible
8625                                                             so that the store
8626                                                             may have already
8627                                                             completed.)
8628
8629                                                         2. *Following
8630                                                            instructions same as
8631                                                            corresponding load
8632                                                            atomic acquire,
8633                                                            except must generate
8634                                                            all instructions even
8635                                                            for OpenCL.*
8636     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8637                                                         local address space cannot
8638                                                         be used.*
8639
8640                                                         *Same as corresponding
8641                                                         load atomic acquire,
8642                                                         except must generate
8643                                                         all instructions even
8644                                                         for OpenCL.*
8645
8646     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8647                               - system       - generic     vmcnt(0)
8648
8649                                                           - If TgSplit execution mode,
8650                                                             omit lgkmcnt(0).
8651                                                           - Could be split into
8652                                                             separate s_waitcnt
8653                                                             vmcnt(0)
8654                                                             and s_waitcnt
8655                                                             lgkmcnt(0) to allow
8656                                                             them to be
8657                                                             independently moved
8658                                                             according to the
8659                                                             following rules.
8660                                                           - s_waitcnt lgkmcnt(0)
8661                                                             must happen after
8662                                                             preceding
8663                                                             global/generic load
8664                                                             atomic/store
8665                                                             atomic/atomicrmw
8666                                                             with memory
8667                                                             ordering of seq_cst
8668                                                             and with equal or
8669                                                             wider sync scope.
8670                                                             (Note that seq_cst
8671                                                             fences have their
8672                                                             own s_waitcnt
8673                                                             lgkmcnt(0) and so do
8674                                                             not need to be
8675                                                             considered.)
8676                                                           - s_waitcnt vmcnt(0)
8677                                                             must happen after
8678                                                             preceding
8679                                                             global/generic load
8680                                                             atomic/store
8681                                                             atomic/atomicrmw
8682                                                             with memory
8683                                                             ordering of seq_cst
8684                                                             and with equal or
8685                                                             wider sync scope.
8686                                                             (Note that seq_cst
8687                                                             fences have their
8688                                                             own s_waitcnt
8689                                                             vmcnt(0) and so do
8690                                                             not need to be
8691                                                             considered.)
8692                                                           - Ensures any
8693                                                             preceding
8694                                                             sequential
8695                                                             consistent global
8696                                                             memory instructions
8697                                                             have completed
8698                                                             before executing
8699                                                             this sequentially
8700                                                             consistent
8701                                                             instruction. This
8702                                                             prevents reordering
8703                                                             a seq_cst store
8704                                                             followed by a
8705                                                             seq_cst load. (Note
8706                                                             that seq_cst is
8707                                                             stronger than
8708                                                             acquire/release as
8709                                                             the reordering of
8710                                                             load acquire
8711                                                             followed by a store
8712                                                             release is
8713                                                             prevented by the
8714                                                             s_waitcnt of
8715                                                             the release, but
8716                                                             there is nothing
8717                                                             preventing a store
8718                                                             release followed by
8719                                                             load acquire from
8720                                                             completing out of
8721                                                             order. The s_waitcnt
8722                                                             could be placed after
8723                                                             seq_store or before
8724                                                             the seq_load. We
8725                                                             choose the load to
8726                                                             make the s_waitcnt be
8727                                                             as late as possible
8728                                                             so that the store
8729                                                             may have already
8730                                                             completed.)
8731
8732                                                         2. *Following
8733                                                            instructions same as
8734                                                            corresponding load
8735                                                            atomic acquire,
8736                                                            except must generate
8737                                                            all instructions even
8738                                                            for OpenCL.*
8739     store atomic seq_cst      - singlethread - global   *Same as corresponding
8740                               - wavefront    - local    store atomic release,
8741                               - workgroup    - generic  except must generate
8742                               - agent                   all instructions even
8743                               - system                  for OpenCL.*
8744     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8745                               - wavefront    - local    atomicrmw acq_rel,
8746                               - workgroup    - generic  except must generate
8747                               - agent                   all instructions even
8748                               - system                  for OpenCL.*
8749     fence        seq_cst      - singlethread *none*     *Same as corresponding
8750                               - wavefront               fence acq_rel,
8751                               - workgroup               except must generate
8752                               - agent                   all instructions even
8753                               - system                  for OpenCL.*
8754     ============ ============ ============== ========== ================================
8755
8756.. _amdgpu-amdhsa-memory-model-gfx940:
8757
8758Memory Model GFX940
8759+++++++++++++++++++
8760
8761For GFX940:
8762
8763* Each agent has multiple shader arrays (SA).
8764* Each SA has multiple compute units (CU).
8765* Each CU has multiple SIMDs that execute wavefronts.
8766* The wavefronts for a single work-group are executed in the same CU but may be
8767  executed by different SIMDs. The exception is when in tgsplit execution mode
8768  when the wavefronts may be executed by different SIMDs in different CUs.
8769* Each CU has a single LDS memory shared by the wavefronts of the work-groups
8770  executing on it. The exception is when in tgsplit execution mode when no LDS
8771  is allocated as wavefronts of the same work-group can be in different CUs.
8772* All LDS operations of a CU are performed as wavefront wide operations in a
8773  global order and involve no caching. Completion is reported to a wavefront in
8774  execution order.
8775* The LDS memory has multiple request queues shared by the SIMDs of a
8776  CU. Therefore, the LDS operations performed by different wavefronts of a
8777  work-group can be reordered relative to each other, which can result in
8778  reordering the visibility of vector memory operations with respect to LDS
8779  operations of other wavefronts in the same work-group. A ``s_waitcnt
8780  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8781  vector memory operations between wavefronts of a work-group, but not between
8782  operations performed by the same wavefront.
8783* The vector memory operations are performed as wavefront wide operations and
8784  completion is reported to a wavefront in execution order. The exception is
8785  that ``flat_load/store/atomic`` instructions can report out of vector memory
8786  order if they access LDS memory, and out of LDS operation order if they access
8787  global memory.
8788* The vector memory operations access a single vector L1 cache shared by all
8789  SIMDs a CU. Therefore:
8790
8791  * No special action is required for coherence between the lanes of a single
8792    wavefront.
8793
8794  * No special action is required for coherence between wavefronts in the same
8795    work-group since they execute on the same CU. The exception is when in
8796    tgsplit execution mode as wavefronts of the same work-group can be in
8797    different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8798    the L1 cache.
8799
8800  * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8801    between wavefronts executing in different work-groups as they may be
8802    executing on different CUs.
8803
8804  * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8805    Therefore, they do not use the sc0 bit for coherence and instead use it to
8806    indicate if the instruction returns the original value being updated. They
8807    do use sc1 to indicate system or agent scope coherence.
8808
8809* The scalar memory operations access a scalar L1 cache shared by all wavefronts
8810  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8811  scalar operations are used in a restricted way so do not impact the memory
8812  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8813* The vector and scalar memory operations use an L2 cache.
8814
8815  * The gfx940 can be configured as a number of smaller agents with each having
8816    a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8817    larger agents with groups of CUs on each agent each sharing separate L2
8818    caches.
8819  * The L2 cache has independent channels to service disjoint ranges of virtual
8820    addresses.
8821  * Each CU has a separate request queue per channel for its associated L2.
8822    Therefore, the vector and scalar memory operations performed by wavefronts
8823    executing with different L1 caches and the same L2 cache can be reordered
8824    relative to each other.
8825  * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8826    vector memory operations of different CUs. It ensures a previous vector
8827    memory operation has completed before executing a subsequent vector memory
8828    or LDS operation and so can be used to meet the requirements of acquire and
8829    release.
8830  * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8831    (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8832    the PTE C-bit set for memory not local to the L2.
8833
8834    * Any local memory cache lines will be automatically invalidated by writes
8835      from CUs associated with other L2 caches, or writes from the CPU, due to
8836      the cache probe caused by the PTE C-bit.
8837    * XGMI accesses from the CPU to local memory may be cached on the CPU.
8838      Subsequent access from the GPU will automatically invalidate or writeback
8839      the CPU cache due to the L2 probe filter.
8840    * To ensure coherence of local memory writes of CUs with different L1 caches
8841      in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8842      agent is configured to have a single L2, or will writeback dirty L2 cache
8843      lines if configured to have multiple L2 caches.
8844    * To ensure coherence of local memory writes of CUs in different agents a
8845      ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8846    * To ensure coherence of local memory reads of CUs with different L1 caches
8847      in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8848      agent is configured to have a single L2, or will invalidate non-local L2
8849      cache lines if configured to have multiple L2 caches.
8850    * To ensure coherence of local memory reads of CUs in different agents a
8851      ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8852      lines if configured to have multiple L2 caches.
8853
8854  * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8855    UC (uncached) which bypasses the L2.
8856
8857Scalar memory operations are only used to access memory that is proven to not
8858change during the execution of the kernel dispatch. This includes constant
8859address space and global address space for program scope ``const`` variables.
8860Therefore, the kernel machine code does not have to maintain the scalar cache to
8861ensure it is coherent with the vector caches. The scalar and vector caches are
8862invalidated between kernel dispatches by CP since constant address space data
8863may change between kernel dispatch executions. See
8864:ref:`amdgpu-amdhsa-memory-spaces`.
8865
8866The one exception is if scalar writes are used to spill SGPR registers. In this
8867case the AMDGPU backend ensures the memory location used to spill is never
8868accessed by vector memory operations at the same time. If scalar writes are used
8869then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8870return since the locations may be used for vector memory instructions by a
8871future wavefront that uses the same scratch area, or a function call that
8872creates a frame at the same address, respectively. There is no need for a
8873``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8874
8875For kernarg backing memory:
8876
8877* CP invalidates the L1 cache at the start of each kernel dispatch.
8878* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8879  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8880  cache. This also causes it to be treated as non-volatile and so is not
8881  invalidated by ``*_vol``.
8882* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8883  so the L2 cache will be coherent with the CPU and other agents.
8884
8885Scratch backing memory (which is used for the private address space) is accessed
8886with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8887only accessed by a single thread, and is always write-before-read, there is
8888never a need to invalidate these entries from the L1 cache. Hence all cache
8889invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8890
8891The code sequences used to implement the memory model for GFX940 are defined
8892in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8893
8894  .. table:: AMDHSA Memory Model Code Sequences GFX940
8895     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8896
8897     ============ ============ ============== ========== ================================
8898     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8899                  Ordering     Sync Scope     Address    GFX940
8900                                              Space
8901     ============ ============ ============== ========== ================================
8902     **Non-Atomic**
8903     ------------------------------------------------------------------------------------
8904     load         *none*       *none*         - global   - !volatile & !nontemporal
8905                                              - generic
8906                                              - private    1. buffer/global/flat_load
8907                                              - constant
8908                                                         - !volatile & nontemporal
8909
8910                                                           1. buffer/global/flat_load
8911                                                              nt=1
8912
8913                                                         - volatile
8914
8915                                                           1. buffer/global/flat_load
8916                                                              sc0=1 sc1=1
8917                                                           2. s_waitcnt vmcnt(0)
8918
8919                                                            - Must happen before
8920                                                              any following volatile
8921                                                              global/generic
8922                                                              load/store.
8923                                                            - Ensures that
8924                                                              volatile
8925                                                              operations to
8926                                                              different
8927                                                              addresses will not
8928                                                              be reordered by
8929                                                              hardware.
8930
8931     load         *none*       *none*         - local    1. ds_load
8932     store        *none*       *none*         - global   - !volatile & !nontemporal
8933                                              - generic
8934                                              - private    1. buffer/global/flat_store
8935                                              - constant
8936                                                         - !volatile & nontemporal
8937
8938                                                           1. buffer/global/flat_store
8939                                                              nt=1
8940
8941                                                         - volatile
8942
8943                                                           1. buffer/global/flat_store
8944                                                              sc0=1 sc1=1
8945                                                           2. s_waitcnt vmcnt(0)
8946
8947                                                            - Must happen before
8948                                                              any following volatile
8949                                                              global/generic
8950                                                              load/store.
8951                                                            - Ensures that
8952                                                              volatile
8953                                                              operations to
8954                                                              different
8955                                                              addresses will not
8956                                                              be reordered by
8957                                                              hardware.
8958
8959     store        *none*       *none*         - local    1. ds_store
8960     **Unordered Atomic**
8961     ------------------------------------------------------------------------------------
8962     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8963     store atomic unordered    *any*          *any*      *Same as non-atomic*.
8964     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8965     **Monotonic Atomic**
8966     ------------------------------------------------------------------------------------
8967     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8968                               - wavefront    - generic
8969     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8970                                              - generic     sc0=1
8971     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
8972                               - wavefront               local address space cannot
8973                               - workgroup               be used.*
8974
8975                                                         1. ds_load
8976     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8977                                              - generic     sc1=1
8978     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
8979                                              - generic     sc0=1 sc1=1
8980     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8981                               - wavefront    - generic
8982     store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
8983                                              - generic     sc0=1
8984     store atomic monotonic    - agent        - global   1. buffer/global/flat_store
8985                                              - generic     sc1=1
8986     store atomic monotonic    - system       - global   1. buffer/global/flat_store
8987                                              - generic     sc0=1 sc1=1
8988     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
8989                               - wavefront               local address space cannot
8990                               - workgroup               be used.*
8991
8992                                                         1. ds_store
8993     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
8994                               - wavefront    - generic
8995                               - workgroup
8996                               - agent
8997     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
8998                                              - generic     sc1=1
8999     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
9000                               - wavefront               local address space cannot
9001                               - workgroup               be used.*
9002
9003                                                         1. ds_atomic
9004     **Acquire Atomic**
9005     ------------------------------------------------------------------------------------
9006     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
9007                               - wavefront    - local
9008                                              - generic
9009     load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
9010                                                         2. s_waitcnt vmcnt(0)
9011
9012                                                           - If not TgSplit execution
9013                                                             mode, omit.
9014                                                           - Must happen before the
9015                                                             following buffer_inv.
9016
9017                                                         3. buffer_inv sc0=1
9018
9019                                                           - If not TgSplit execution
9020                                                             mode, omit.
9021                                                           - Must happen before
9022                                                             any following
9023                                                             global/generic
9024                                                             load/load
9025                                                             atomic/store/store
9026                                                             atomic/atomicrmw.
9027                                                           - Ensures that
9028                                                             following
9029                                                             loads will not see
9030                                                             stale data.
9031
9032     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
9033                                                         local address space cannot
9034                                                         be used.*
9035
9036                                                         1. ds_load
9037                                                         2. s_waitcnt lgkmcnt(0)
9038
9039                                                           - If OpenCL, omit.
9040                                                           - Must happen before
9041                                                             any following
9042                                                             global/generic
9043                                                             load/load
9044                                                             atomic/store/store
9045                                                             atomic/atomicrmw.
9046                                                           - Ensures any
9047                                                             following global
9048                                                             data read is no
9049                                                             older than the local load
9050                                                             atomic value being
9051                                                             acquired.
9052
9053     load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
9054                                                         2. s_waitcnt lgkm/vmcnt(0)
9055
9056                                                           - Use lgkmcnt(0) if not
9057                                                             TgSplit execution mode
9058                                                             and vmcnt(0) if TgSplit
9059                                                             execution mode.
9060                                                           - If OpenCL, omit lgkmcnt(0).
9061                                                           - Must happen before
9062                                                             the following
9063                                                             buffer_inv and any
9064                                                             following global/generic
9065                                                             load/load
9066                                                             atomic/store/store
9067                                                             atomic/atomicrmw.
9068                                                           - Ensures any
9069                                                             following global
9070                                                             data read is no
9071                                                             older than a local load
9072                                                             atomic value being
9073                                                             acquired.
9074
9075                                                         3. buffer_inv sc0=1
9076
9077                                                           - If not TgSplit execution
9078                                                             mode, omit.
9079                                                           - Ensures that
9080                                                             following
9081                                                             loads will not see
9082                                                             stale data.
9083
9084     load atomic  acquire      - agent        - global   1. buffer/global_load
9085                                                            sc1=1
9086                                                         2. s_waitcnt vmcnt(0)
9087
9088                                                           - Must happen before
9089                                                             following
9090                                                             buffer_inv.
9091                                                           - Ensures the load
9092                                                             has completed
9093                                                             before invalidating
9094                                                             the cache.
9095
9096                                                         3. buffer_inv sc1=1
9097
9098                                                           - Must happen before
9099                                                             any following
9100                                                             global/generic
9101                                                             load/load
9102                                                             atomic/atomicrmw.
9103                                                           - Ensures that
9104                                                             following
9105                                                             loads will not see
9106                                                             stale global data.
9107
9108     load atomic  acquire      - system       - global   1. buffer/global/flat_load
9109                                                            sc0=1 sc1=1
9110                                                         2. s_waitcnt vmcnt(0)
9111
9112                                                           - Must happen before
9113                                                             following
9114                                                             buffer_inv.
9115                                                           - Ensures the load
9116                                                             has completed
9117                                                             before invalidating
9118                                                             the cache.
9119
9120                                                         3. buffer_inv sc0=1 sc1=1
9121
9122                                                           - Must happen before
9123                                                             any following
9124                                                             global/generic
9125                                                             load/load
9126                                                             atomic/atomicrmw.
9127                                                           - Ensures that
9128                                                             following
9129                                                             loads will not see
9130                                                             stale MTYPE NC global data.
9131                                                             MTYPE RW and CC memory will
9132                                                             never be stale due to the
9133                                                             memory probes.
9134
9135     load atomic  acquire      - agent        - generic  1. flat_load sc1=1
9136                                                         2. s_waitcnt vmcnt(0) &
9137                                                            lgkmcnt(0)
9138
9139                                                           - If TgSplit execution mode,
9140                                                             omit lgkmcnt(0).
9141                                                           - If OpenCL omit
9142                                                             lgkmcnt(0).
9143                                                           - Must happen before
9144                                                             following
9145                                                             buffer_inv.
9146                                                           - Ensures the flat_load
9147                                                             has completed
9148                                                             before invalidating
9149                                                             the cache.
9150
9151                                                         3. buffer_inv sc1=1
9152
9153                                                           - Must happen before
9154                                                             any following
9155                                                             global/generic
9156                                                             load/load
9157                                                             atomic/atomicrmw.
9158                                                           - Ensures that
9159                                                             following loads
9160                                                             will not see stale
9161                                                             global data.
9162
9163     load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
9164                                                         2. s_waitcnt vmcnt(0) &
9165                                                            lgkmcnt(0)
9166
9167                                                           - If TgSplit execution mode,
9168                                                             omit lgkmcnt(0).
9169                                                           - If OpenCL omit
9170                                                             lgkmcnt(0).
9171                                                           - Must happen before
9172                                                             the following
9173                                                             buffer_inv.
9174                                                           - Ensures the flat_load
9175                                                             has completed
9176                                                             before invalidating
9177                                                             the caches.
9178
9179                                                         3. buffer_inv sc0=1 sc1=1
9180
9181                                                           - Must happen before
9182                                                             any following
9183                                                             global/generic
9184                                                             load/load
9185                                                             atomic/atomicrmw.
9186                                                           - Ensures that
9187                                                             following
9188                                                             loads will not see
9189                                                             stale MTYPE NC global data.
9190                                                             MTYPE RW and CC memory will
9191                                                             never be stale due to the
9192                                                             memory probes.
9193
9194     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
9195                               - wavefront    - generic
9196     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
9197                               - wavefront               local address space cannot
9198                                                         be used.*
9199
9200                                                         1. ds_atomic
9201     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
9202                                                         2. s_waitcnt vmcnt(0)
9203
9204                                                           - If not TgSplit execution
9205                                                             mode, omit.
9206                                                           - Must happen before the
9207                                                             following buffer_inv.
9208                                                           - Ensures the atomicrmw
9209                                                             has completed
9210                                                             before invalidating
9211                                                             the cache.
9212
9213                                                         3. buffer_inv sc0=1
9214
9215                                                           - If not TgSplit execution
9216                                                             mode, omit.
9217                                                           - Must happen before
9218                                                             any following
9219                                                             global/generic
9220                                                             load/load
9221                                                             atomic/atomicrmw.
9222                                                           - Ensures that
9223                                                             following loads
9224                                                             will not see stale
9225                                                             global data.
9226
9227     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
9228                                                         local address space cannot
9229                                                         be used.*
9230
9231                                                         1. ds_atomic
9232                                                         2. s_waitcnt lgkmcnt(0)
9233
9234                                                           - If OpenCL, omit.
9235                                                           - Must happen before
9236                                                             any following
9237                                                             global/generic
9238                                                             load/load
9239                                                             atomic/store/store
9240                                                             atomic/atomicrmw.
9241                                                           - Ensures any
9242                                                             following global
9243                                                             data read is no
9244                                                             older than the local
9245                                                             atomicrmw value
9246                                                             being acquired.
9247
9248     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
9249                                                         2. s_waitcnt lgkm/vmcnt(0)
9250
9251                                                           - Use lgkmcnt(0) if not
9252                                                             TgSplit execution mode
9253                                                             and vmcnt(0) if TgSplit
9254                                                             execution mode.
9255                                                           - If OpenCL, omit lgkmcnt(0).
9256                                                           - Must happen before
9257                                                             the following
9258                                                             buffer_inv and
9259                                                             any following
9260                                                             global/generic
9261                                                             load/load
9262                                                             atomic/store/store
9263                                                             atomic/atomicrmw.
9264                                                           - Ensures any
9265                                                             following global
9266                                                             data read is no
9267                                                             older than a local
9268                                                             atomicrmw value
9269                                                             being acquired.
9270
9271                                                         3. buffer_inv sc0=1
9272
9273                                                           - If not TgSplit execution
9274                                                             mode, omit.
9275                                                           - Ensures that
9276                                                             following
9277                                                             loads will not see
9278                                                             stale data.
9279
9280     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
9281                                                         2. s_waitcnt vmcnt(0)
9282
9283                                                           - Must happen before
9284                                                             following
9285                                                             buffer_inv.
9286                                                           - Ensures the
9287                                                             atomicrmw has
9288                                                             completed before
9289                                                             invalidating the
9290                                                             cache.
9291
9292                                                         3. buffer_inv sc1=1
9293
9294                                                           - Must happen before
9295                                                             any following
9296                                                             global/generic
9297                                                             load/load
9298                                                             atomic/atomicrmw.
9299                                                           - Ensures that
9300                                                             following loads
9301                                                             will not see stale
9302                                                             global data.
9303
9304     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
9305                                                            sc1=1
9306                                                         2. s_waitcnt vmcnt(0)
9307
9308                                                           - Must happen before
9309                                                             following
9310                                                             buffer_inv.
9311                                                           - Ensures the
9312                                                             atomicrmw has
9313                                                             completed before
9314                                                             invalidating the
9315                                                             caches.
9316
9317                                                         3. buffer_inv sc0=1 sc1=1
9318
9319                                                           - Must happen before
9320                                                             any following
9321                                                             global/generic
9322                                                             load/load
9323                                                             atomic/atomicrmw.
9324                                                           - Ensures that
9325                                                             following
9326                                                             loads will not see
9327                                                             stale MTYPE NC global data.
9328                                                             MTYPE RW and CC memory will
9329                                                             never be stale due to the
9330                                                             memory probes.
9331
9332     atomicrmw    acquire      - agent        - generic  1. flat_atomic
9333                                                         2. s_waitcnt vmcnt(0) &
9334                                                            lgkmcnt(0)
9335
9336                                                           - If TgSplit execution mode,
9337                                                             omit lgkmcnt(0).
9338                                                           - If OpenCL, omit
9339                                                             lgkmcnt(0).
9340                                                           - Must happen before
9341                                                             following
9342                                                             buffer_inv.
9343                                                           - Ensures the
9344                                                             atomicrmw has
9345                                                             completed before
9346                                                             invalidating the
9347                                                             cache.
9348
9349                                                         3. buffer_inv sc1=1
9350
9351                                                           - Must happen before
9352                                                             any following
9353                                                             global/generic
9354                                                             load/load
9355                                                             atomic/atomicrmw.
9356                                                           - Ensures that
9357                                                             following loads
9358                                                             will not see stale
9359                                                             global data.
9360
9361     atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
9362                                                         2. s_waitcnt vmcnt(0) &
9363                                                            lgkmcnt(0)
9364
9365                                                           - If TgSplit execution mode,
9366                                                             omit lgkmcnt(0).
9367                                                           - If OpenCL, omit
9368                                                             lgkmcnt(0).
9369                                                           - Must happen before
9370                                                             following
9371                                                             buffer_inv.
9372                                                           - Ensures the
9373                                                             atomicrmw has
9374                                                             completed before
9375                                                             invalidating the
9376                                                             caches.
9377
9378                                                         3. buffer_inv sc0=1 sc1=1
9379
9380                                                           - Must happen before
9381                                                             any following
9382                                                             global/generic
9383                                                             load/load
9384                                                             atomic/atomicrmw.
9385                                                           - Ensures that
9386                                                             following
9387                                                             loads will not see
9388                                                             stale MTYPE NC global data.
9389                                                             MTYPE RW and CC memory will
9390                                                             never be stale due to the
9391                                                             memory probes.
9392
9393     fence        acquire      - singlethread *none*     *none*
9394                               - wavefront
9395     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9396
9397                                                           - Use lgkmcnt(0) if not
9398                                                             TgSplit execution mode
9399                                                             and vmcnt(0) if TgSplit
9400                                                             execution mode.
9401                                                           - If OpenCL and
9402                                                             address space is
9403                                                             not generic, omit
9404                                                             lgkmcnt(0).
9405                                                           - If OpenCL and
9406                                                             address space is
9407                                                             local, omit
9408                                                             vmcnt(0).
9409                                                           - However, since LLVM
9410                                                             currently has no
9411                                                             address space on
9412                                                             the fence need to
9413                                                             conservatively
9414                                                             always generate. If
9415                                                             fence had an
9416                                                             address space then
9417                                                             set to address
9418                                                             space of OpenCL
9419                                                             fence flag, or to
9420                                                             generic if both
9421                                                             local and global
9422                                                             flags are
9423                                                             specified.
9424                                                           - s_waitcnt vmcnt(0)
9425                                                             must happen after
9426                                                             any preceding
9427                                                             global/generic load
9428                                                             atomic/
9429                                                             atomicrmw
9430                                                             with an equal or
9431                                                             wider sync scope
9432                                                             and memory ordering
9433                                                             stronger than
9434                                                             unordered (this is
9435                                                             termed the
9436                                                             fence-paired-atomic).
9437                                                           - s_waitcnt lgkmcnt(0)
9438                                                             must happen after
9439                                                             any preceding
9440                                                             local/generic load
9441                                                             atomic/atomicrmw
9442                                                             with an equal or
9443                                                             wider sync scope
9444                                                             and memory ordering
9445                                                             stronger than
9446                                                             unordered (this is
9447                                                             termed the
9448                                                             fence-paired-atomic).
9449                                                           - Must happen before
9450                                                             the following
9451                                                             buffer_inv and
9452                                                             any following
9453                                                             global/generic
9454                                                             load/load
9455                                                             atomic/store/store
9456                                                             atomic/atomicrmw.
9457                                                           - Ensures any
9458                                                             following global
9459                                                             data read is no
9460                                                             older than the
9461                                                             value read by the
9462                                                             fence-paired-atomic.
9463
9464                                                         3. buffer_inv sc0=1
9465
9466                                                           - If not TgSplit execution
9467                                                             mode, omit.
9468                                                           - Ensures that
9469                                                             following
9470                                                             loads will not see
9471                                                             stale data.
9472
9473     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9474                                                            vmcnt(0)
9475
9476                                                           - If TgSplit execution mode,
9477                                                             omit lgkmcnt(0).
9478                                                           - If OpenCL and
9479                                                             address space is
9480                                                             not generic, omit
9481                                                             lgkmcnt(0).
9482                                                           - However, since LLVM
9483                                                             currently has no
9484                                                             address space on
9485                                                             the fence need to
9486                                                             conservatively
9487                                                             always generate
9488                                                             (see comment for
9489                                                             previous fence).
9490                                                           - Could be split into
9491                                                             separate s_waitcnt
9492                                                             vmcnt(0) and
9493                                                             s_waitcnt
9494                                                             lgkmcnt(0) to allow
9495                                                             them to be
9496                                                             independently moved
9497                                                             according to the
9498                                                             following rules.
9499                                                           - s_waitcnt vmcnt(0)
9500                                                             must happen after
9501                                                             any preceding
9502                                                             global/generic load
9503                                                             atomic/atomicrmw
9504                                                             with an equal or
9505                                                             wider sync scope
9506                                                             and memory ordering
9507                                                             stronger than
9508                                                             unordered (this is
9509                                                             termed the
9510                                                             fence-paired-atomic).
9511                                                           - s_waitcnt lgkmcnt(0)
9512                                                             must happen after
9513                                                             any preceding
9514                                                             local/generic load
9515                                                             atomic/atomicrmw
9516                                                             with an equal or
9517                                                             wider sync scope
9518                                                             and memory ordering
9519                                                             stronger than
9520                                                             unordered (this is
9521                                                             termed the
9522                                                             fence-paired-atomic).
9523                                                           - Must happen before
9524                                                             the following
9525                                                             buffer_inv.
9526                                                           - Ensures that the
9527                                                             fence-paired atomic
9528                                                             has completed
9529                                                             before invalidating
9530                                                             the
9531                                                             cache. Therefore
9532                                                             any following
9533                                                             locations read must
9534                                                             be no older than
9535                                                             the value read by
9536                                                             the
9537                                                             fence-paired-atomic.
9538
9539                                                         2. buffer_inv sc1=1
9540
9541                                                           - Must happen before any
9542                                                             following global/generic
9543                                                             load/load
9544                                                             atomic/store/store
9545                                                             atomic/atomicrmw.
9546                                                           - Ensures that
9547                                                             following loads
9548                                                             will not see stale
9549                                                             global data.
9550
9551     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
9552                                                            vmcnt(0)
9553
9554                                                           - If TgSplit execution mode,
9555                                                             omit lgkmcnt(0).
9556                                                           - If OpenCL and
9557                                                             address space is
9558                                                             not generic, omit
9559                                                             lgkmcnt(0).
9560                                                           - However, since LLVM
9561                                                             currently has no
9562                                                             address space on
9563                                                             the fence need to
9564                                                             conservatively
9565                                                             always generate
9566                                                             (see comment for
9567                                                             previous fence).
9568                                                           - Could be split into
9569                                                             separate s_waitcnt
9570                                                             vmcnt(0) and
9571                                                             s_waitcnt
9572                                                             lgkmcnt(0) to allow
9573                                                             them to be
9574                                                             independently moved
9575                                                             according to the
9576                                                             following rules.
9577                                                           - s_waitcnt vmcnt(0)
9578                                                             must happen after
9579                                                             any preceding
9580                                                             global/generic load
9581                                                             atomic/atomicrmw
9582                                                             with an equal or
9583                                                             wider sync scope
9584                                                             and memory ordering
9585                                                             stronger than
9586                                                             unordered (this is
9587                                                             termed the
9588                                                             fence-paired-atomic).
9589                                                           - s_waitcnt lgkmcnt(0)
9590                                                             must happen after
9591                                                             any preceding
9592                                                             local/generic load
9593                                                             atomic/atomicrmw
9594                                                             with an equal or
9595                                                             wider sync scope
9596                                                             and memory ordering
9597                                                             stronger than
9598                                                             unordered (this is
9599                                                             termed the
9600                                                             fence-paired-atomic).
9601                                                           - Must happen before
9602                                                             the following
9603                                                             buffer_inv.
9604                                                           - Ensures that the
9605                                                             fence-paired atomic
9606                                                             has completed
9607                                                             before invalidating
9608                                                             the
9609                                                             cache. Therefore
9610                                                             any following
9611                                                             locations read must
9612                                                             be no older than
9613                                                             the value read by
9614                                                             the
9615                                                             fence-paired-atomic.
9616
9617                                                         2. buffer_inv sc0=1 sc1=1
9618
9619                                                           - Must happen before any
9620                                                             following global/generic
9621                                                             load/load
9622                                                             atomic/store/store
9623                                                             atomic/atomicrmw.
9624                                                           - Ensures that
9625                                                             following loads
9626                                                             will not see stale
9627                                                             global data.
9628
9629     **Release Atomic**
9630     ------------------------------------------------------------------------------------
9631     store atomic release      - singlethread - global   1. buffer/global/flat_store
9632                               - wavefront    - generic
9633     store atomic release      - singlethread - local    *If TgSplit execution mode,
9634                               - wavefront               local address space cannot
9635                                                         be used.*
9636
9637                                                         1. ds_store
9638     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9639                                              - generic
9640                                                           - Use lgkmcnt(0) if not
9641                                                             TgSplit execution mode
9642                                                             and vmcnt(0) if TgSplit
9643                                                             execution mode.
9644                                                           - If OpenCL, omit lgkmcnt(0).
9645                                                           - s_waitcnt vmcnt(0)
9646                                                             must happen after
9647                                                             any preceding
9648                                                             global/generic load/store/
9649                                                             load atomic/store atomic/
9650                                                             atomicrmw.
9651                                                           - s_waitcnt lgkmcnt(0)
9652                                                             must happen after
9653                                                             any preceding
9654                                                             local/generic
9655                                                             load/store/load
9656                                                             atomic/store
9657                                                             atomic/atomicrmw.
9658                                                           - Must happen before
9659                                                             the following
9660                                                             store.
9661                                                           - Ensures that all
9662                                                             memory operations
9663                                                             have
9664                                                             completed before
9665                                                             performing the
9666                                                             store that is being
9667                                                             released.
9668
9669                                                         2. buffer/global/flat_store sc0=1
9670     store atomic release      - workgroup    - local    *If TgSplit execution mode,
9671                                                         local address space cannot
9672                                                         be used.*
9673
9674                                                         1. ds_store
9675     store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
9676                                              - generic
9677                                                           - Must happen before
9678                                                             following s_waitcnt.
9679                                                           - Performs L2 writeback to
9680                                                             ensure previous
9681                                                             global/generic
9682                                                             store/atomicrmw are
9683                                                             visible at agent scope.
9684
9685                                                         2. s_waitcnt lgkmcnt(0) &
9686                                                            vmcnt(0)
9687
9688                                                           - If TgSplit execution mode,
9689                                                             omit lgkmcnt(0).
9690                                                           - If OpenCL and
9691                                                             address space is
9692                                                             not generic, omit
9693                                                             lgkmcnt(0).
9694                                                           - Could be split into
9695                                                             separate s_waitcnt
9696                                                             vmcnt(0) and
9697                                                             s_waitcnt
9698                                                             lgkmcnt(0) to allow
9699                                                             them to be
9700                                                             independently moved
9701                                                             according to the
9702                                                             following rules.
9703                                                           - s_waitcnt vmcnt(0)
9704                                                             must happen after
9705                                                             any preceding
9706                                                             global/generic
9707                                                             load/store/load
9708                                                             atomic/store
9709                                                             atomic/atomicrmw.
9710                                                           - s_waitcnt lgkmcnt(0)
9711                                                             must happen after
9712                                                             any preceding
9713                                                             local/generic
9714                                                             load/store/load
9715                                                             atomic/store
9716                                                             atomic/atomicrmw.
9717                                                           - Must happen before
9718                                                             the following
9719                                                             store.
9720                                                           - Ensures that all
9721                                                             memory operations
9722                                                             to memory have
9723                                                             completed before
9724                                                             performing the
9725                                                             store that is being
9726                                                             released.
9727
9728                                                         3. buffer/global/flat_store sc1=1
9729     store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9730                                              - generic
9731                                                           - Must happen before
9732                                                             following s_waitcnt.
9733                                                           - Performs L2 writeback to
9734                                                             ensure previous
9735                                                             global/generic
9736                                                             store/atomicrmw are
9737                                                             visible at system scope.
9738
9739                                                         2. s_waitcnt lgkmcnt(0) &
9740                                                            vmcnt(0)
9741
9742                                                           - If TgSplit execution mode,
9743                                                             omit lgkmcnt(0).
9744                                                           - If OpenCL and
9745                                                             address space is
9746                                                             not generic, omit
9747                                                             lgkmcnt(0).
9748                                                           - Could be split into
9749                                                             separate s_waitcnt
9750                                                             vmcnt(0) and
9751                                                             s_waitcnt
9752                                                             lgkmcnt(0) to allow
9753                                                             them to be
9754                                                             independently moved
9755                                                             according to the
9756                                                             following rules.
9757                                                           - s_waitcnt vmcnt(0)
9758                                                             must happen after any
9759                                                             preceding
9760                                                             global/generic
9761                                                             load/store/load
9762                                                             atomic/store
9763                                                             atomic/atomicrmw.
9764                                                           - s_waitcnt lgkmcnt(0)
9765                                                             must happen after any
9766                                                             preceding
9767                                                             local/generic
9768                                                             load/store/load
9769                                                             atomic/store
9770                                                             atomic/atomicrmw.
9771                                                           - Must happen before
9772                                                             the following
9773                                                             store.
9774                                                           - Ensures that all
9775                                                             memory operations
9776                                                             to memory and the L2
9777                                                             writeback have
9778                                                             completed before
9779                                                             performing the
9780                                                             store that is being
9781                                                             released.
9782
9783                                                         3. buffer/global/flat_store
9784                                                            sc0=1 sc1=1
9785     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
9786                               - wavefront    - generic
9787     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
9788                               - wavefront               local address space cannot
9789                                                         be used.*
9790
9791                                                         1. ds_atomic
9792     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9793                                              - generic
9794                                                           - Use lgkmcnt(0) if not
9795                                                             TgSplit execution mode
9796                                                             and vmcnt(0) if TgSplit
9797                                                             execution mode.
9798                                                           - If OpenCL, omit
9799                                                             lgkmcnt(0).
9800                                                           - s_waitcnt vmcnt(0)
9801                                                             must happen after
9802                                                             any preceding
9803                                                             global/generic load/store/
9804                                                             load atomic/store atomic/
9805                                                             atomicrmw.
9806                                                           - s_waitcnt lgkmcnt(0)
9807                                                             must happen after
9808                                                             any preceding
9809                                                             local/generic
9810                                                             load/store/load
9811                                                             atomic/store
9812                                                             atomic/atomicrmw.
9813                                                           - Must happen before
9814                                                             the following
9815                                                             atomicrmw.
9816                                                           - Ensures that all
9817                                                             memory operations
9818                                                             have
9819                                                             completed before
9820                                                             performing the
9821                                                             atomicrmw that is
9822                                                             being released.
9823
9824                                                         2. buffer/global/flat_atomic sc0=1
9825     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
9826                                                         local address space cannot
9827                                                         be used.*
9828
9829                                                         1. ds_atomic
9830     atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
9831                                              - generic
9832                                                           - Must happen before
9833                                                             following s_waitcnt.
9834                                                           - Performs L2 writeback to
9835                                                             ensure previous
9836                                                             global/generic
9837                                                             store/atomicrmw are
9838                                                             visible at agent scope.
9839
9840                                                         2. s_waitcnt lgkmcnt(0) &
9841                                                            vmcnt(0)
9842
9843                                                           - If TgSplit execution mode,
9844                                                             omit lgkmcnt(0).
9845                                                           - If OpenCL, omit
9846                                                             lgkmcnt(0).
9847                                                           - Could be split into
9848                                                             separate s_waitcnt
9849                                                             vmcnt(0) and
9850                                                             s_waitcnt
9851                                                             lgkmcnt(0) to allow
9852                                                             them to be
9853                                                             independently moved
9854                                                             according to the
9855                                                             following rules.
9856                                                           - s_waitcnt vmcnt(0)
9857                                                             must happen after
9858                                                             any preceding
9859                                                             global/generic
9860                                                             load/store/load
9861                                                             atomic/store
9862                                                             atomic/atomicrmw.
9863                                                           - s_waitcnt lgkmcnt(0)
9864                                                             must happen after
9865                                                             any preceding
9866                                                             local/generic
9867                                                             load/store/load
9868                                                             atomic/store
9869                                                             atomic/atomicrmw.
9870                                                           - Must happen before
9871                                                             the following
9872                                                             atomicrmw.
9873                                                           - Ensures that all
9874                                                             memory operations
9875                                                             to global and local
9876                                                             have completed
9877                                                             before performing
9878                                                             the atomicrmw that
9879                                                             is being released.
9880
9881                                                         3. buffer/global/flat_atomic sc1=1
9882     atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9883                                              - generic
9884                                                           - Must happen before
9885                                                             following s_waitcnt.
9886                                                           - Performs L2 writeback to
9887                                                             ensure previous
9888                                                             global/generic
9889                                                             store/atomicrmw are
9890                                                             visible at system scope.
9891
9892                                                         2. s_waitcnt lgkmcnt(0) &
9893                                                            vmcnt(0)
9894
9895                                                           - If TgSplit execution mode,
9896                                                             omit lgkmcnt(0).
9897                                                           - If OpenCL, omit
9898                                                             lgkmcnt(0).
9899                                                           - Could be split into
9900                                                             separate s_waitcnt
9901                                                             vmcnt(0) and
9902                                                             s_waitcnt
9903                                                             lgkmcnt(0) to allow
9904                                                             them to be
9905                                                             independently moved
9906                                                             according to the
9907                                                             following rules.
9908                                                           - s_waitcnt vmcnt(0)
9909                                                             must happen after
9910                                                             any preceding
9911                                                             global/generic
9912                                                             load/store/load
9913                                                             atomic/store
9914                                                             atomic/atomicrmw.
9915                                                           - s_waitcnt lgkmcnt(0)
9916                                                             must happen after
9917                                                             any preceding
9918                                                             local/generic
9919                                                             load/store/load
9920                                                             atomic/store
9921                                                             atomic/atomicrmw.
9922                                                           - Must happen before
9923                                                             the following
9924                                                             atomicrmw.
9925                                                           - Ensures that all
9926                                                             memory operations
9927                                                             to memory and the L2
9928                                                             writeback have
9929                                                             completed before
9930                                                             performing the
9931                                                             store that is being
9932                                                             released.
9933
9934                                                         3. buffer/global/flat_atomic
9935                                                            sc0=1 sc1=1
9936     fence        release      - singlethread *none*     *none*
9937                               - wavefront
9938     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9939
9940                                                           - Use lgkmcnt(0) if not
9941                                                             TgSplit execution mode
9942                                                             and vmcnt(0) if TgSplit
9943                                                             execution mode.
9944                                                           - If OpenCL and
9945                                                             address space is
9946                                                             not generic, omit
9947                                                             lgkmcnt(0).
9948                                                           - If OpenCL and
9949                                                             address space is
9950                                                             local, omit
9951                                                             vmcnt(0).
9952                                                           - However, since LLVM
9953                                                             currently has no
9954                                                             address space on
9955                                                             the fence need to
9956                                                             conservatively
9957                                                             always generate. If
9958                                                             fence had an
9959                                                             address space then
9960                                                             set to address
9961                                                             space of OpenCL
9962                                                             fence flag, or to
9963                                                             generic if both
9964                                                             local and global
9965                                                             flags are
9966                                                             specified.
9967                                                           - s_waitcnt vmcnt(0)
9968                                                             must happen after
9969                                                             any preceding
9970                                                             global/generic
9971                                                             load/store/
9972                                                             load atomic/store atomic/
9973                                                             atomicrmw.
9974                                                           - s_waitcnt lgkmcnt(0)
9975                                                             must happen after
9976                                                             any preceding
9977                                                             local/generic
9978                                                             load/load
9979                                                             atomic/store/store
9980                                                             atomic/atomicrmw.
9981                                                           - Must happen before
9982                                                             any following store
9983                                                             atomic/atomicrmw
9984                                                             with an equal or
9985                                                             wider sync scope
9986                                                             and memory ordering
9987                                                             stronger than
9988                                                             unordered (this is
9989                                                             termed the
9990                                                             fence-paired-atomic).
9991                                                           - Ensures that all
9992                                                             memory operations
9993                                                             have
9994                                                             completed before
9995                                                             performing the
9996                                                             following
9997                                                             fence-paired-atomic.
9998
9999     fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
10000
10001                                                           - If OpenCL and
10002                                                             address space is
10003                                                             local, omit.
10004                                                           - Must happen before
10005                                                             following s_waitcnt.
10006                                                           - Performs L2 writeback to
10007                                                             ensure previous
10008                                                             global/generic
10009                                                             store/atomicrmw are
10010                                                             visible at agent scope.
10011
10012                                                         2. s_waitcnt lgkmcnt(0) &
10013                                                            vmcnt(0)
10014
10015                                                           - If TgSplit execution mode,
10016                                                             omit lgkmcnt(0).
10017                                                           - If OpenCL and
10018                                                             address space is
10019                                                             not generic, omit
10020                                                             lgkmcnt(0).
10021                                                           - If OpenCL and
10022                                                             address space is
10023                                                             local, omit
10024                                                             vmcnt(0).
10025                                                           - However, since LLVM
10026                                                             currently has no
10027                                                             address space on
10028                                                             the fence need to
10029                                                             conservatively
10030                                                             always generate. If
10031                                                             fence had an
10032                                                             address space then
10033                                                             set to address
10034                                                             space of OpenCL
10035                                                             fence flag, or to
10036                                                             generic if both
10037                                                             local and global
10038                                                             flags are
10039                                                             specified.
10040                                                           - Could be split into
10041                                                             separate s_waitcnt
10042                                                             vmcnt(0) and
10043                                                             s_waitcnt
10044                                                             lgkmcnt(0) to allow
10045                                                             them to be
10046                                                             independently moved
10047                                                             according to the
10048                                                             following rules.
10049                                                           - s_waitcnt vmcnt(0)
10050                                                             must happen after
10051                                                             any preceding
10052                                                             global/generic
10053                                                             load/store/load
10054                                                             atomic/store
10055                                                             atomic/atomicrmw.
10056                                                           - s_waitcnt lgkmcnt(0)
10057                                                             must happen after
10058                                                             any preceding
10059                                                             local/generic
10060                                                             load/store/load
10061                                                             atomic/store
10062                                                             atomic/atomicrmw.
10063                                                           - Must happen before
10064                                                             any following store
10065                                                             atomic/atomicrmw
10066                                                             with an equal or
10067                                                             wider sync scope
10068                                                             and memory ordering
10069                                                             stronger than
10070                                                             unordered (this is
10071                                                             termed the
10072                                                             fence-paired-atomic).
10073                                                           - Ensures that all
10074                                                             memory operations
10075                                                             have
10076                                                             completed before
10077                                                             performing the
10078                                                             following
10079                                                             fence-paired-atomic.
10080
10081     fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10082
10083                                                           - Must happen before
10084                                                             following s_waitcnt.
10085                                                           - Performs L2 writeback to
10086                                                             ensure previous
10087                                                             global/generic
10088                                                             store/atomicrmw are
10089                                                             visible at system scope.
10090
10091                                                         2. s_waitcnt lgkmcnt(0) &
10092                                                            vmcnt(0)
10093
10094                                                           - If TgSplit execution mode,
10095                                                             omit lgkmcnt(0).
10096                                                           - If OpenCL and
10097                                                             address space is
10098                                                             not generic, omit
10099                                                             lgkmcnt(0).
10100                                                           - If OpenCL and
10101                                                             address space is
10102                                                             local, omit
10103                                                             vmcnt(0).
10104                                                           - However, since LLVM
10105                                                             currently has no
10106                                                             address space on
10107                                                             the fence need to
10108                                                             conservatively
10109                                                             always generate. If
10110                                                             fence had an
10111                                                             address space then
10112                                                             set to address
10113                                                             space of OpenCL
10114                                                             fence flag, or to
10115                                                             generic if both
10116                                                             local and global
10117                                                             flags are
10118                                                             specified.
10119                                                           - Could be split into
10120                                                             separate s_waitcnt
10121                                                             vmcnt(0) and
10122                                                             s_waitcnt
10123                                                             lgkmcnt(0) to allow
10124                                                             them to be
10125                                                             independently moved
10126                                                             according to the
10127                                                             following rules.
10128                                                           - s_waitcnt vmcnt(0)
10129                                                             must happen after
10130                                                             any preceding
10131                                                             global/generic
10132                                                             load/store/load
10133                                                             atomic/store
10134                                                             atomic/atomicrmw.
10135                                                           - s_waitcnt lgkmcnt(0)
10136                                                             must happen after
10137                                                             any preceding
10138                                                             local/generic
10139                                                             load/store/load
10140                                                             atomic/store
10141                                                             atomic/atomicrmw.
10142                                                           - Must happen before
10143                                                             any following store
10144                                                             atomic/atomicrmw
10145                                                             with an equal or
10146                                                             wider sync scope
10147                                                             and memory ordering
10148                                                             stronger than
10149                                                             unordered (this is
10150                                                             termed the
10151                                                             fence-paired-atomic).
10152                                                           - Ensures that all
10153                                                             memory operations
10154                                                             have
10155                                                             completed before
10156                                                             performing the
10157                                                             following
10158                                                             fence-paired-atomic.
10159
10160     **Acquire-Release Atomic**
10161     ------------------------------------------------------------------------------------
10162     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
10163                               - wavefront    - generic
10164     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
10165                               - wavefront               local address space cannot
10166                                                         be used.*
10167
10168                                                         1. ds_atomic
10169     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10170
10171                                                           - Use lgkmcnt(0) if not
10172                                                             TgSplit execution mode
10173                                                             and vmcnt(0) if TgSplit
10174                                                             execution mode.
10175                                                           - If OpenCL, omit
10176                                                             lgkmcnt(0).
10177                                                           - Must happen after
10178                                                             any preceding
10179                                                             local/generic
10180                                                             load/store/load
10181                                                             atomic/store
10182                                                             atomic/atomicrmw.
10183                                                           - s_waitcnt vmcnt(0)
10184                                                             must happen after
10185                                                             any preceding
10186                                                             global/generic load/store/
10187                                                             load atomic/store atomic/
10188                                                             atomicrmw.
10189                                                           - s_waitcnt lgkmcnt(0)
10190                                                             must happen after
10191                                                             any preceding
10192                                                             local/generic
10193                                                             load/store/load
10194                                                             atomic/store
10195                                                             atomic/atomicrmw.
10196                                                           - Must happen before
10197                                                             the following
10198                                                             atomicrmw.
10199                                                           - Ensures that all
10200                                                             memory operations
10201                                                             have
10202                                                             completed before
10203                                                             performing the
10204                                                             atomicrmw that is
10205                                                             being released.
10206
10207                                                         2. buffer/global_atomic
10208                                                         3. s_waitcnt vmcnt(0)
10209
10210                                                           - If not TgSplit execution
10211                                                             mode, omit.
10212                                                           - Must happen before
10213                                                             the following
10214                                                             buffer_inv.
10215                                                           - Ensures any
10216                                                             following global
10217                                                             data read is no
10218                                                             older than the
10219                                                             atomicrmw value
10220                                                             being acquired.
10221
10222                                                         4. buffer_inv sc0=1
10223
10224                                                           - If not TgSplit execution
10225                                                             mode, omit.
10226                                                           - Ensures that
10227                                                             following
10228                                                             loads will not see
10229                                                             stale data.
10230
10231     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
10232                                                         local address space cannot
10233                                                         be used.*
10234
10235                                                         1. ds_atomic
10236                                                         2. s_waitcnt lgkmcnt(0)
10237
10238                                                           - If OpenCL, omit.
10239                                                           - Must happen before
10240                                                             any following
10241                                                             global/generic
10242                                                             load/load
10243                                                             atomic/store/store
10244                                                             atomic/atomicrmw.
10245                                                           - Ensures any
10246                                                             following global
10247                                                             data read is no
10248                                                             older than the local load
10249                                                             atomic value being
10250                                                             acquired.
10251
10252     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
10253
10254                                                           - Use lgkmcnt(0) if not
10255                                                             TgSplit execution mode
10256                                                             and vmcnt(0) if TgSplit
10257                                                             execution mode.
10258                                                           - If OpenCL, omit
10259                                                             lgkmcnt(0).
10260                                                           - s_waitcnt vmcnt(0)
10261                                                             must happen after
10262                                                             any preceding
10263                                                             global/generic load/store/
10264                                                             load atomic/store atomic/
10265                                                             atomicrmw.
10266                                                           - s_waitcnt lgkmcnt(0)
10267                                                             must happen after
10268                                                             any preceding
10269                                                             local/generic
10270                                                             load/store/load
10271                                                             atomic/store
10272                                                             atomic/atomicrmw.
10273                                                           - Must happen before
10274                                                             the following
10275                                                             atomicrmw.
10276                                                           - Ensures that all
10277                                                             memory operations
10278                                                             have
10279                                                             completed before
10280                                                             performing the
10281                                                             atomicrmw that is
10282                                                             being released.
10283
10284                                                         2. flat_atomic
10285                                                         3. s_waitcnt lgkmcnt(0) &
10286                                                            vmcnt(0)
10287
10288                                                           - If not TgSplit execution
10289                                                             mode, omit vmcnt(0).
10290                                                           - If OpenCL, omit
10291                                                             lgkmcnt(0).
10292                                                           - Must happen before
10293                                                             the following
10294                                                             buffer_inv and
10295                                                             any following
10296                                                             global/generic
10297                                                             load/load
10298                                                             atomic/store/store
10299                                                             atomic/atomicrmw.
10300                                                           - Ensures any
10301                                                             following global
10302                                                             data read is no
10303                                                             older than a local load
10304                                                             atomic value being
10305                                                             acquired.
10306
10307                                                         3. buffer_inv sc0=1
10308
10309                                                           - If not TgSplit execution
10310                                                             mode, omit.
10311                                                           - Ensures that
10312                                                             following
10313                                                             loads will not see
10314                                                             stale data.
10315
10316     atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
10317
10318                                                           - Must happen before
10319                                                             following s_waitcnt.
10320                                                           - Performs L2 writeback to
10321                                                             ensure previous
10322                                                             global/generic
10323                                                             store/atomicrmw are
10324                                                             visible at agent scope.
10325
10326                                                         2. s_waitcnt lgkmcnt(0) &
10327                                                            vmcnt(0)
10328
10329                                                           - If TgSplit execution mode,
10330                                                             omit lgkmcnt(0).
10331                                                           - If OpenCL, omit
10332                                                             lgkmcnt(0).
10333                                                           - Could be split into
10334                                                             separate s_waitcnt
10335                                                             vmcnt(0) and
10336                                                             s_waitcnt
10337                                                             lgkmcnt(0) to allow
10338                                                             them to be
10339                                                             independently moved
10340                                                             according to the
10341                                                             following rules.
10342                                                           - s_waitcnt vmcnt(0)
10343                                                             must happen after
10344                                                             any preceding
10345                                                             global/generic
10346                                                             load/store/load
10347                                                             atomic/store
10348                                                             atomic/atomicrmw.
10349                                                           - s_waitcnt lgkmcnt(0)
10350                                                             must happen after
10351                                                             any preceding
10352                                                             local/generic
10353                                                             load/store/load
10354                                                             atomic/store
10355                                                             atomic/atomicrmw.
10356                                                           - Must happen before
10357                                                             the following
10358                                                             atomicrmw.
10359                                                           - Ensures that all
10360                                                             memory operations
10361                                                             to global have
10362                                                             completed before
10363                                                             performing the
10364                                                             atomicrmw that is
10365                                                             being released.
10366
10367                                                         3. buffer/global_atomic
10368                                                         4. s_waitcnt vmcnt(0)
10369
10370                                                           - Must happen before
10371                                                             following
10372                                                             buffer_inv.
10373                                                           - Ensures the
10374                                                             atomicrmw has
10375                                                             completed before
10376                                                             invalidating the
10377                                                             cache.
10378
10379                                                         5. buffer_inv sc1=1
10380
10381                                                           - Must happen before
10382                                                             any following
10383                                                             global/generic
10384                                                             load/load
10385                                                             atomic/atomicrmw.
10386                                                           - Ensures that
10387                                                             following loads
10388                                                             will not see stale
10389                                                             global data.
10390
10391     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10392
10393                                                           - Must happen before
10394                                                             following s_waitcnt.
10395                                                           - Performs L2 writeback to
10396                                                             ensure previous
10397                                                             global/generic
10398                                                             store/atomicrmw are
10399                                                             visible at system scope.
10400
10401                                                         2. s_waitcnt lgkmcnt(0) &
10402                                                            vmcnt(0)
10403
10404                                                           - If TgSplit execution mode,
10405                                                             omit lgkmcnt(0).
10406                                                           - If OpenCL, omit
10407                                                             lgkmcnt(0).
10408                                                           - Could be split into
10409                                                             separate s_waitcnt
10410                                                             vmcnt(0) and
10411                                                             s_waitcnt
10412                                                             lgkmcnt(0) to allow
10413                                                             them to be
10414                                                             independently moved
10415                                                             according to the
10416                                                             following rules.
10417                                                           - s_waitcnt vmcnt(0)
10418                                                             must happen after
10419                                                             any preceding
10420                                                             global/generic
10421                                                             load/store/load
10422                                                             atomic/store
10423                                                             atomic/atomicrmw.
10424                                                           - s_waitcnt lgkmcnt(0)
10425                                                             must happen after
10426                                                             any preceding
10427                                                             local/generic
10428                                                             load/store/load
10429                                                             atomic/store
10430                                                             atomic/atomicrmw.
10431                                                           - Must happen before
10432                                                             the following
10433                                                             atomicrmw.
10434                                                           - Ensures that all
10435                                                             memory operations
10436                                                             to global and L2 writeback
10437                                                             have completed before
10438                                                             performing the
10439                                                             atomicrmw that is
10440                                                             being released.
10441
10442                                                         3. buffer/global_atomic
10443                                                            sc1=1
10444                                                         4. s_waitcnt vmcnt(0)
10445
10446                                                           - Must happen before
10447                                                             following
10448                                                             buffer_inv.
10449                                                           - Ensures the
10450                                                             atomicrmw has
10451                                                             completed before
10452                                                             invalidating the
10453                                                             caches.
10454
10455                                                         5. buffer_inv sc0=1 sc1=1
10456
10457                                                           - Must happen before
10458                                                             any following
10459                                                             global/generic
10460                                                             load/load
10461                                                             atomic/atomicrmw.
10462                                                           - Ensures that
10463                                                             following loads
10464                                                             will not see stale
10465                                                             MTYPE NC global data.
10466                                                             MTYPE RW and CC memory will
10467                                                             never be stale due to the
10468                                                             memory probes.
10469
10470     atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
10471
10472                                                           - Must happen before
10473                                                             following s_waitcnt.
10474                                                           - Performs L2 writeback to
10475                                                             ensure previous
10476                                                             global/generic
10477                                                             store/atomicrmw are
10478                                                             visible at agent scope.
10479
10480                                                         2. s_waitcnt lgkmcnt(0) &
10481                                                            vmcnt(0)
10482
10483                                                           - If TgSplit execution mode,
10484                                                             omit lgkmcnt(0).
10485                                                           - If OpenCL, omit
10486                                                             lgkmcnt(0).
10487                                                           - Could be split into
10488                                                             separate s_waitcnt
10489                                                             vmcnt(0) and
10490                                                             s_waitcnt
10491                                                             lgkmcnt(0) to allow
10492                                                             them to be
10493                                                             independently moved
10494                                                             according to the
10495                                                             following rules.
10496                                                           - s_waitcnt vmcnt(0)
10497                                                             must happen after
10498                                                             any preceding
10499                                                             global/generic
10500                                                             load/store/load
10501                                                             atomic/store
10502                                                             atomic/atomicrmw.
10503                                                           - s_waitcnt lgkmcnt(0)
10504                                                             must happen after
10505                                                             any preceding
10506                                                             local/generic
10507                                                             load/store/load
10508                                                             atomic/store
10509                                                             atomic/atomicrmw.
10510                                                           - Must happen before
10511                                                             the following
10512                                                             atomicrmw.
10513                                                           - Ensures that all
10514                                                             memory operations
10515                                                             to global have
10516                                                             completed before
10517                                                             performing the
10518                                                             atomicrmw that is
10519                                                             being released.
10520
10521                                                         3. flat_atomic
10522                                                         4. s_waitcnt vmcnt(0) &
10523                                                            lgkmcnt(0)
10524
10525                                                           - If TgSplit execution mode,
10526                                                             omit lgkmcnt(0).
10527                                                           - If OpenCL, omit
10528                                                             lgkmcnt(0).
10529                                                           - Must happen before
10530                                                             following
10531                                                             buffer_inv.
10532                                                           - Ensures the
10533                                                             atomicrmw has
10534                                                             completed before
10535                                                             invalidating the
10536                                                             cache.
10537
10538                                                         5. buffer_inv sc1=1
10539
10540                                                           - Must happen before
10541                                                             any following
10542                                                             global/generic
10543                                                             load/load
10544                                                             atomic/atomicrmw.
10545                                                           - Ensures that
10546                                                             following loads
10547                                                             will not see stale
10548                                                             global data.
10549
10550     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
10551
10552                                                           - Must happen before
10553                                                             following s_waitcnt.
10554                                                           - Performs L2 writeback to
10555                                                             ensure previous
10556                                                             global/generic
10557                                                             store/atomicrmw are
10558                                                             visible at system scope.
10559
10560                                                         2. s_waitcnt lgkmcnt(0) &
10561                                                            vmcnt(0)
10562
10563                                                           - If TgSplit execution mode,
10564                                                             omit lgkmcnt(0).
10565                                                           - If OpenCL, omit
10566                                                             lgkmcnt(0).
10567                                                           - Could be split into
10568                                                             separate s_waitcnt
10569                                                             vmcnt(0) and
10570                                                             s_waitcnt
10571                                                             lgkmcnt(0) to allow
10572                                                             them to be
10573                                                             independently moved
10574                                                             according to the
10575                                                             following rules.
10576                                                           - s_waitcnt vmcnt(0)
10577                                                             must happen after
10578                                                             any preceding
10579                                                             global/generic
10580                                                             load/store/load
10581                                                             atomic/store
10582                                                             atomic/atomicrmw.
10583                                                           - s_waitcnt lgkmcnt(0)
10584                                                             must happen after
10585                                                             any preceding
10586                                                             local/generic
10587                                                             load/store/load
10588                                                             atomic/store
10589                                                             atomic/atomicrmw.
10590                                                           - Must happen before
10591                                                             the following
10592                                                             atomicrmw.
10593                                                           - Ensures that all
10594                                                             memory operations
10595                                                             to global and L2 writeback
10596                                                             have completed before
10597                                                             performing the
10598                                                             atomicrmw that is
10599                                                             being released.
10600
10601                                                         3. flat_atomic sc1=1
10602                                                         4. s_waitcnt vmcnt(0) &
10603                                                            lgkmcnt(0)
10604
10605                                                           - If TgSplit execution mode,
10606                                                             omit lgkmcnt(0).
10607                                                           - If OpenCL, omit
10608                                                             lgkmcnt(0).
10609                                                           - Must happen before
10610                                                             following
10611                                                             buffer_inv.
10612                                                           - Ensures the
10613                                                             atomicrmw has
10614                                                             completed before
10615                                                             invalidating the
10616                                                             caches.
10617
10618                                                         5. buffer_inv sc0=1 sc1=1
10619
10620                                                           - Must happen before
10621                                                             any following
10622                                                             global/generic
10623                                                             load/load
10624                                                             atomic/atomicrmw.
10625                                                           - Ensures that
10626                                                             following loads
10627                                                             will not see stale
10628                                                             MTYPE NC global data.
10629                                                             MTYPE RW and CC memory will
10630                                                             never be stale due to the
10631                                                             memory probes.
10632
10633     fence        acq_rel      - singlethread *none*     *none*
10634                               - wavefront
10635     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10636
10637                                                           - Use lgkmcnt(0) if not
10638                                                             TgSplit execution mode
10639                                                             and vmcnt(0) if TgSplit
10640                                                             execution mode.
10641                                                           - If OpenCL and
10642                                                             address space is
10643                                                             not generic, omit
10644                                                             lgkmcnt(0).
10645                                                           - If OpenCL and
10646                                                             address space is
10647                                                             local, omit
10648                                                             vmcnt(0).
10649                                                           - However,
10650                                                             since LLVM
10651                                                             currently has no
10652                                                             address space on
10653                                                             the fence need to
10654                                                             conservatively
10655                                                             always generate
10656                                                             (see comment for
10657                                                             previous fence).
10658                                                           - s_waitcnt vmcnt(0)
10659                                                             must happen after
10660                                                             any preceding
10661                                                             global/generic
10662                                                             load/store/
10663                                                             load atomic/store atomic/
10664                                                             atomicrmw.
10665                                                           - s_waitcnt lgkmcnt(0)
10666                                                             must happen after
10667                                                             any preceding
10668                                                             local/generic
10669                                                             load/load
10670                                                             atomic/store/store
10671                                                             atomic/atomicrmw.
10672                                                           - Must happen before
10673                                                             any following
10674                                                             global/generic
10675                                                             load/load
10676                                                             atomic/store/store
10677                                                             atomic/atomicrmw.
10678                                                           - Ensures that all
10679                                                             memory operations
10680                                                             have
10681                                                             completed before
10682                                                             performing any
10683                                                             following global
10684                                                             memory operations.
10685                                                           - Ensures that the
10686                                                             preceding
10687                                                             local/generic load
10688                                                             atomic/atomicrmw
10689                                                             with an equal or
10690                                                             wider sync scope
10691                                                             and memory ordering
10692                                                             stronger than
10693                                                             unordered (this is
10694                                                             termed the
10695                                                             acquire-fence-paired-atomic)
10696                                                             has completed
10697                                                             before following
10698                                                             global memory
10699                                                             operations. This
10700                                                             satisfies the
10701                                                             requirements of
10702                                                             acquire.
10703                                                           - Ensures that all
10704                                                             previous memory
10705                                                             operations have
10706                                                             completed before a
10707                                                             following
10708                                                             local/generic store
10709                                                             atomic/atomicrmw
10710                                                             with an equal or
10711                                                             wider sync scope
10712                                                             and memory ordering
10713                                                             stronger than
10714                                                             unordered (this is
10715                                                             termed the
10716                                                             release-fence-paired-atomic).
10717                                                             This satisfies the
10718                                                             requirements of
10719                                                             release.
10720                                                           - Must happen before
10721                                                             the following
10722                                                             buffer_inv.
10723                                                           - Ensures that the
10724                                                             acquire-fence-paired
10725                                                             atomic has completed
10726                                                             before invalidating
10727                                                             the
10728                                                             cache. Therefore
10729                                                             any following
10730                                                             locations read must
10731                                                             be no older than
10732                                                             the value read by
10733                                                             the
10734                                                             acquire-fence-paired-atomic.
10735
10736                                                         3. buffer_inv sc0=1
10737
10738                                                           - If not TgSplit execution
10739                                                             mode, omit.
10740                                                           - Ensures that
10741                                                             following
10742                                                             loads will not see
10743                                                             stale data.
10744
10745     fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
10746
10747                                                           - If OpenCL and
10748                                                             address space is
10749                                                             local, omit.
10750                                                           - Must happen before
10751                                                             following s_waitcnt.
10752                                                           - Performs L2 writeback to
10753                                                             ensure previous
10754                                                             global/generic
10755                                                             store/atomicrmw are
10756                                                             visible at agent scope.
10757
10758                                                         2. s_waitcnt lgkmcnt(0) &
10759                                                            vmcnt(0)
10760
10761                                                           - If TgSplit execution mode,
10762                                                             omit lgkmcnt(0).
10763                                                           - If OpenCL and
10764                                                             address space is
10765                                                             not generic, omit
10766                                                             lgkmcnt(0).
10767                                                           - However, since LLVM
10768                                                             currently has no
10769                                                             address space on
10770                                                             the fence need to
10771                                                             conservatively
10772                                                             always generate
10773                                                             (see comment for
10774                                                             previous fence).
10775                                                           - Could be split into
10776                                                             separate s_waitcnt
10777                                                             vmcnt(0) and
10778                                                             s_waitcnt
10779                                                             lgkmcnt(0) to allow
10780                                                             them to be
10781                                                             independently moved
10782                                                             according to the
10783                                                             following rules.
10784                                                           - s_waitcnt vmcnt(0)
10785                                                             must happen after
10786                                                             any preceding
10787                                                             global/generic
10788                                                             load/store/load
10789                                                             atomic/store
10790                                                             atomic/atomicrmw.
10791                                                           - s_waitcnt lgkmcnt(0)
10792                                                             must happen after
10793                                                             any preceding
10794                                                             local/generic
10795                                                             load/store/load
10796                                                             atomic/store
10797                                                             atomic/atomicrmw.
10798                                                           - Must happen before
10799                                                             the following
10800                                                             buffer_inv.
10801                                                           - Ensures that the
10802                                                             preceding
10803                                                             global/local/generic
10804                                                             load
10805                                                             atomic/atomicrmw
10806                                                             with an equal or
10807                                                             wider sync scope
10808                                                             and memory ordering
10809                                                             stronger than
10810                                                             unordered (this is
10811                                                             termed the
10812                                                             acquire-fence-paired-atomic)
10813                                                             has completed
10814                                                             before invalidating
10815                                                             the cache. This
10816                                                             satisfies the
10817                                                             requirements of
10818                                                             acquire.
10819                                                           - Ensures that all
10820                                                             previous memory
10821                                                             operations have
10822                                                             completed before a
10823                                                             following
10824                                                             global/local/generic
10825                                                             store
10826                                                             atomic/atomicrmw
10827                                                             with an equal or
10828                                                             wider sync scope
10829                                                             and memory ordering
10830                                                             stronger than
10831                                                             unordered (this is
10832                                                             termed the
10833                                                             release-fence-paired-atomic).
10834                                                             This satisfies the
10835                                                             requirements of
10836                                                             release.
10837
10838                                                         3. buffer_inv sc1=1
10839
10840                                                           - Must happen before
10841                                                             any following
10842                                                             global/generic
10843                                                             load/load
10844                                                             atomic/store/store
10845                                                             atomic/atomicrmw.
10846                                                           - Ensures that
10847                                                             following loads
10848                                                             will not see stale
10849                                                             global data. This
10850                                                             satisfies the
10851                                                             requirements of
10852                                                             acquire.
10853
10854     fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10855
10856                                                           - If OpenCL and
10857                                                             address space is
10858                                                             local, omit.
10859                                                           - Must happen before
10860                                                             following s_waitcnt.
10861                                                           - Performs L2 writeback to
10862                                                             ensure previous
10863                                                             global/generic
10864                                                             store/atomicrmw are
10865                                                             visible at system scope.
10866
10867                                                         1. s_waitcnt lgkmcnt(0) &
10868                                                            vmcnt(0)
10869
10870                                                           - If TgSplit execution mode,
10871                                                             omit lgkmcnt(0).
10872                                                           - If OpenCL and
10873                                                             address space is
10874                                                             not generic, omit
10875                                                             lgkmcnt(0).
10876                                                           - However, since LLVM
10877                                                             currently has no
10878                                                             address space on
10879                                                             the fence need to
10880                                                             conservatively
10881                                                             always generate
10882                                                             (see comment for
10883                                                             previous fence).
10884                                                           - Could be split into
10885                                                             separate s_waitcnt
10886                                                             vmcnt(0) and
10887                                                             s_waitcnt
10888                                                             lgkmcnt(0) to allow
10889                                                             them to be
10890                                                             independently moved
10891                                                             according to the
10892                                                             following rules.
10893                                                           - s_waitcnt vmcnt(0)
10894                                                             must happen after
10895                                                             any preceding
10896                                                             global/generic
10897                                                             load/store/load
10898                                                             atomic/store
10899                                                             atomic/atomicrmw.
10900                                                           - s_waitcnt lgkmcnt(0)
10901                                                             must happen after
10902                                                             any preceding
10903                                                             local/generic
10904                                                             load/store/load
10905                                                             atomic/store
10906                                                             atomic/atomicrmw.
10907                                                           - Must happen before
10908                                                             the following
10909                                                             buffer_inv.
10910                                                           - Ensures that the
10911                                                             preceding
10912                                                             global/local/generic
10913                                                             load
10914                                                             atomic/atomicrmw
10915                                                             with an equal or
10916                                                             wider sync scope
10917                                                             and memory ordering
10918                                                             stronger than
10919                                                             unordered (this is
10920                                                             termed the
10921                                                             acquire-fence-paired-atomic)
10922                                                             has completed
10923                                                             before invalidating
10924                                                             the cache. This
10925                                                             satisfies the
10926                                                             requirements of
10927                                                             acquire.
10928                                                           - Ensures that all
10929                                                             previous memory
10930                                                             operations have
10931                                                             completed before a
10932                                                             following
10933                                                             global/local/generic
10934                                                             store
10935                                                             atomic/atomicrmw
10936                                                             with an equal or
10937                                                             wider sync scope
10938                                                             and memory ordering
10939                                                             stronger than
10940                                                             unordered (this is
10941                                                             termed the
10942                                                             release-fence-paired-atomic).
10943                                                             This satisfies the
10944                                                             requirements of
10945                                                             release.
10946
10947                                                         2. buffer_inv sc0=1 sc1=1
10948
10949                                                           - Must happen before
10950                                                             any following
10951                                                             global/generic
10952                                                             load/load
10953                                                             atomic/store/store
10954                                                             atomic/atomicrmw.
10955                                                           - Ensures that
10956                                                             following loads
10957                                                             will not see stale
10958                                                             MTYPE NC global data.
10959                                                             MTYPE RW and CC memory will
10960                                                             never be stale due to the
10961                                                             memory probes.
10962
10963     **Sequential Consistent Atomic**
10964     ------------------------------------------------------------------------------------
10965     load atomic  seq_cst      - singlethread - global   *Same as corresponding
10966                               - wavefront    - local    load atomic acquire,
10967                                              - generic  except must generate
10968                                                         all instructions even
10969                                                         for OpenCL.*
10970     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10971                                              - generic
10972                                                           - Use lgkmcnt(0) if not
10973                                                             TgSplit execution mode
10974                                                             and vmcnt(0) if TgSplit
10975                                                             execution mode.
10976                                                           - s_waitcnt lgkmcnt(0) must
10977                                                             happen after
10978                                                             preceding
10979                                                             local/generic load
10980                                                             atomic/store
10981                                                             atomic/atomicrmw
10982                                                             with memory
10983                                                             ordering of seq_cst
10984                                                             and with equal or
10985                                                             wider sync scope.
10986                                                             (Note that seq_cst
10987                                                             fences have their
10988                                                             own s_waitcnt
10989                                                             lgkmcnt(0) and so do
10990                                                             not need to be
10991                                                             considered.)
10992                                                           - s_waitcnt vmcnt(0)
10993                                                             must happen after
10994                                                             preceding
10995                                                             global/generic load
10996                                                             atomic/store
10997                                                             atomic/atomicrmw
10998                                                             with memory
10999                                                             ordering of seq_cst
11000                                                             and with equal or
11001                                                             wider sync scope.
11002                                                             (Note that seq_cst
11003                                                             fences have their
11004                                                             own s_waitcnt
11005                                                             vmcnt(0) and so do
11006                                                             not need to be
11007                                                             considered.)
11008                                                           - Ensures any
11009                                                             preceding
11010                                                             sequential
11011                                                             consistent global/local
11012                                                             memory instructions
11013                                                             have completed
11014                                                             before executing
11015                                                             this sequentially
11016                                                             consistent
11017                                                             instruction. This
11018                                                             prevents reordering
11019                                                             a seq_cst store
11020                                                             followed by a
11021                                                             seq_cst load. (Note
11022                                                             that seq_cst is
11023                                                             stronger than
11024                                                             acquire/release as
11025                                                             the reordering of
11026                                                             load acquire
11027                                                             followed by a store
11028                                                             release is
11029                                                             prevented by the
11030                                                             s_waitcnt of
11031                                                             the release, but
11032                                                             there is nothing
11033                                                             preventing a store
11034                                                             release followed by
11035                                                             load acquire from
11036                                                             completing out of
11037                                                             order. The s_waitcnt
11038                                                             could be placed after
11039                                                             seq_store or before
11040                                                             the seq_load. We
11041                                                             choose the load to
11042                                                             make the s_waitcnt be
11043                                                             as late as possible
11044                                                             so that the store
11045                                                             may have already
11046                                                             completed.)
11047
11048                                                         2. *Following
11049                                                            instructions same as
11050                                                            corresponding load
11051                                                            atomic acquire,
11052                                                            except must generate
11053                                                            all instructions even
11054                                                            for OpenCL.*
11055     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
11056                                                         local address space cannot
11057                                                         be used.*
11058
11059                                                         *Same as corresponding
11060                                                         load atomic acquire,
11061                                                         except must generate
11062                                                         all instructions even
11063                                                         for OpenCL.*
11064
11065     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
11066                               - system       - generic     vmcnt(0)
11067
11068                                                           - If TgSplit execution mode,
11069                                                             omit lgkmcnt(0).
11070                                                           - Could be split into
11071                                                             separate s_waitcnt
11072                                                             vmcnt(0)
11073                                                             and s_waitcnt
11074                                                             lgkmcnt(0) to allow
11075                                                             them to be
11076                                                             independently moved
11077                                                             according to the
11078                                                             following rules.
11079                                                           - s_waitcnt lgkmcnt(0)
11080                                                             must happen after
11081                                                             preceding
11082                                                             global/generic load
11083                                                             atomic/store
11084                                                             atomic/atomicrmw
11085                                                             with memory
11086                                                             ordering of seq_cst
11087                                                             and with equal or
11088                                                             wider sync scope.
11089                                                             (Note that seq_cst
11090                                                             fences have their
11091                                                             own s_waitcnt
11092                                                             lgkmcnt(0) and so do
11093                                                             not need to be
11094                                                             considered.)
11095                                                           - s_waitcnt vmcnt(0)
11096                                                             must happen after
11097                                                             preceding
11098                                                             global/generic load
11099                                                             atomic/store
11100                                                             atomic/atomicrmw
11101                                                             with memory
11102                                                             ordering of seq_cst
11103                                                             and with equal or
11104                                                             wider sync scope.
11105                                                             (Note that seq_cst
11106                                                             fences have their
11107                                                             own s_waitcnt
11108                                                             vmcnt(0) and so do
11109                                                             not need to be
11110                                                             considered.)
11111                                                           - Ensures any
11112                                                             preceding
11113                                                             sequential
11114                                                             consistent global
11115                                                             memory instructions
11116                                                             have completed
11117                                                             before executing
11118                                                             this sequentially
11119                                                             consistent
11120                                                             instruction. This
11121                                                             prevents reordering
11122                                                             a seq_cst store
11123                                                             followed by a
11124                                                             seq_cst load. (Note
11125                                                             that seq_cst is
11126                                                             stronger than
11127                                                             acquire/release as
11128                                                             the reordering of
11129                                                             load acquire
11130                                                             followed by a store
11131                                                             release is
11132                                                             prevented by the
11133                                                             s_waitcnt of
11134                                                             the release, but
11135                                                             there is nothing
11136                                                             preventing a store
11137                                                             release followed by
11138                                                             load acquire from
11139                                                             completing out of
11140                                                             order. The s_waitcnt
11141                                                             could be placed after
11142                                                             seq_store or before
11143                                                             the seq_load. We
11144                                                             choose the load to
11145                                                             make the s_waitcnt be
11146                                                             as late as possible
11147                                                             so that the store
11148                                                             may have already
11149                                                             completed.)
11150
11151                                                         2. *Following
11152                                                            instructions same as
11153                                                            corresponding load
11154                                                            atomic acquire,
11155                                                            except must generate
11156                                                            all instructions even
11157                                                            for OpenCL.*
11158     store atomic seq_cst      - singlethread - global   *Same as corresponding
11159                               - wavefront    - local    store atomic release,
11160                               - workgroup    - generic  except must generate
11161                               - agent                   all instructions even
11162                               - system                  for OpenCL.*
11163     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
11164                               - wavefront    - local    atomicrmw acq_rel,
11165                               - workgroup    - generic  except must generate
11166                               - agent                   all instructions even
11167                               - system                  for OpenCL.*
11168     fence        seq_cst      - singlethread *none*     *Same as corresponding
11169                               - wavefront               fence acq_rel,
11170                               - workgroup               except must generate
11171                               - agent                   all instructions even
11172                               - system                  for OpenCL.*
11173     ============ ============ ============== ========== ================================
11174
11175.. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11176
11177Memory Model GFX10-GFX11
11178++++++++++++++++++++++++
11179
11180For GFX10-GFX11:
11181
11182* Each agent has multiple shader arrays (SA).
11183* Each SA has multiple work-group processors (WGP).
11184* Each WGP has multiple compute units (CU).
11185* Each CU has multiple SIMDs that execute wavefronts.
11186* The wavefronts for a single work-group are executed in the same
11187  WGP. In CU wavefront execution mode the wavefronts may be executed by
11188  different SIMDs in the same CU. In WGP wavefront execution mode the
11189  wavefronts may be executed by different SIMDs in different CUs in the same
11190  WGP.
11191* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11192  executing on it.
11193* All LDS operations of a WGP are performed as wavefront wide operations in a
11194  global order and involve no caching. Completion is reported to a wavefront in
11195  execution order.
11196* The LDS memory has multiple request queues shared by the SIMDs of a
11197  WGP. Therefore, the LDS operations performed by different wavefronts of a
11198  work-group can be reordered relative to each other, which can result in
11199  reordering the visibility of vector memory operations with respect to LDS
11200  operations of other wavefronts in the same work-group. A ``s_waitcnt
11201  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11202  vector memory operations between wavefronts of a work-group, but not between
11203  operations performed by the same wavefront.
11204* The vector memory operations are performed as wavefront wide operations.
11205  Completion of load/store/sample operations are reported to a wavefront in
11206  execution order of other load/store/sample operations performed by that
11207  wavefront.
11208* The vector memory operations access a vector L0 cache. There is a single L0
11209  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11210  special action is required for coherence between the lanes of a single
11211  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11212  wavefronts executing in the same work-group as they may be executing on SIMDs
11213  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11214  required for coherence between wavefronts executing in different work-groups
11215  as they may be executing on different WGPs.
11216* The scalar memory operations access a scalar L0 cache shared by all wavefronts
11217  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11218  operations are used in a restricted way so do not impact the memory model. See
11219  :ref:`amdgpu-amdhsa-memory-spaces`.
11220* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11221  the same SA. Therefore, no special action is required for coherence between
11222  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11223  required for coherence between wavefronts executing in different work-groups
11224  as they may be executing on different SAs that access different L1s.
11225* The L1 caches have independent quadrants to service disjoint ranges of virtual
11226  addresses.
11227* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11228  vector and scalar memory operations performed by different wavefronts, whether
11229  executing in the same or different work-groups (which may be executing on
11230  different CUs accessing different L0s), can be reordered relative to each
11231  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11232  synchronization between vector memory operations of different wavefronts. It
11233  ensures a previous vector memory operation has completed before executing a
11234  subsequent vector memory or LDS operation and so can be used to meet the
11235  requirements of acquire, release and sequential consistency.
11236* The L1 caches use an L2 cache shared by all SAs on the same agent.
11237* The L2 cache has independent channels to service disjoint ranges of virtual
11238  addresses.
11239* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11240  quadrant has a separate request queue per L2 channel. Therefore, the vector
11241  and scalar memory operations performed by wavefronts executing in different
11242  work-groups (which may be executing on different SAs) of an agent can be
11243  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11244  required to ensure synchronization between vector memory operations of
11245  different SAs. It ensures a previous vector memory operation has completed
11246  before executing a subsequent vector memory and so can be used to meet the
11247  requirements of acquire, release and sequential consistency.
11248* The L2 cache can be kept coherent with other agents on some targets, or ranges
11249  of virtual addresses can be set up to bypass it to ensure system coherence.
11250* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11251  The MALL cache is fully coherent with GPU memory and has no impact on system
11252  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11253
11254Scalar memory operations are only used to access memory that is proven to not
11255change during the execution of the kernel dispatch. This includes constant
11256address space and global address space for program scope ``const`` variables.
11257Therefore, the kernel machine code does not have to maintain the scalar cache to
11258ensure it is coherent with the vector caches. The scalar and vector caches are
11259invalidated between kernel dispatches by CP since constant address space data
11260may change between kernel dispatch executions. See
11261:ref:`amdgpu-amdhsa-memory-spaces`.
11262
11263The one exception is if scalar writes are used to spill SGPR registers. In this
11264case the AMDGPU backend ensures the memory location used to spill is never
11265accessed by vector memory operations at the same time. If scalar writes are used
11266then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11267return since the locations may be used for vector memory instructions by a
11268future wavefront that uses the same scratch area, or a function call that
11269creates a frame at the same address, respectively. There is no need for a
11270``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11271
11272For kernarg backing memory:
11273
11274* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11275* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11276  needing to invalidate the L2 cache.
11277* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11278  so the L2 cache will be coherent with the CPU and other agents.
11279
11280Scratch backing memory (which is used for the private address space) is accessed
11281with MTYPE NC (non-coherent). Since the private address space is only accessed
11282by a single thread, and is always write-before-read, there is never a need to
11283invalidate these entries from the L0 or L1 caches.
11284
11285Wavefronts are executed in native mode with in-order reporting of loads and
11286sample instructions. In this mode vmcnt reports completion of load, atomic with
11287return and sample instructions in order, and the vscnt reports the completion of
11288store and atomic without return in order. See ``MEM_ORDERED`` field in
11289:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11290
11291Wavefronts can be executed in WGP or CU wavefront execution mode:
11292
11293* In WGP wavefront execution mode the wavefronts of a work-group are executed
11294  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11295  CU L0 caches is required for work-group synchronization. Also accesses to L1
11296  at work-group scope need to be explicitly ordered as the accesses from
11297  different CUs are not ordered.
11298* In CU wavefront execution mode the wavefronts of a work-group are executed on
11299  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11300  the work-group access the same L0 which in turn ensures L1 accesses are
11301  ordered and so do not require explicit management of the caches for
11302  work-group synchronization.
11303
11304See ``WGP_MODE`` field in
11305:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11306:ref:`amdgpu-target-features`.
11307
11308The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11309table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11310
11311  .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11312     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11313
11314     ============ ============ ============== ========== ================================
11315     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
11316                  Ordering     Sync Scope     Address    GFX10-GFX11
11317                                              Space
11318     ============ ============ ============== ========== ================================
11319     **Non-Atomic**
11320     ------------------------------------------------------------------------------------
11321     load         *none*       *none*         - global   - !volatile & !nontemporal
11322                                              - generic
11323                                              - private    1. buffer/global/flat_load
11324                                              - constant
11325                                                         - !volatile & nontemporal
11326
11327                                                           1. buffer/global/flat_load
11328                                                              slc=1 dlc=1
11329
11330                                                            - If GFX10, omit dlc=1.
11331
11332                                                         - volatile
11333
11334                                                           1. buffer/global/flat_load
11335                                                              glc=1 dlc=1
11336
11337                                                           2. s_waitcnt vmcnt(0)
11338
11339                                                            - Must happen before
11340                                                              any following volatile
11341                                                              global/generic
11342                                                              load/store.
11343                                                            - Ensures that
11344                                                              volatile
11345                                                              operations to
11346                                                              different
11347                                                              addresses will not
11348                                                              be reordered by
11349                                                              hardware.
11350
11351     load         *none*       *none*         - local    1. ds_load
11352     store        *none*       *none*         - global   - !volatile & !nontemporal
11353                                              - generic
11354                                              - private    1. buffer/global/flat_store
11355                                              - constant
11356                                                         - !volatile & nontemporal
11357
11358                                                           1. buffer/global/flat_store
11359                                                              glc=1 slc=1 dlc=1
11360
11361                                                            - If GFX10, omit dlc=1.
11362
11363                                                         - volatile
11364
11365                                                           1. buffer/global/flat_store
11366                                                              dlc=1
11367
11368                                                            - If GFX10, omit dlc=1.
11369
11370                                                           2. s_waitcnt vscnt(0)
11371
11372                                                            - Must happen before
11373                                                              any following volatile
11374                                                              global/generic
11375                                                              load/store.
11376                                                            - Ensures that
11377                                                              volatile
11378                                                              operations to
11379                                                              different
11380                                                              addresses will not
11381                                                              be reordered by
11382                                                              hardware.
11383
11384     store        *none*       *none*         - local    1. ds_store
11385     **Unordered Atomic**
11386     ------------------------------------------------------------------------------------
11387     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
11388     store atomic unordered    *any*          *any*      *Same as non-atomic*.
11389     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
11390     **Monotonic Atomic**
11391     ------------------------------------------------------------------------------------
11392     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
11393                               - wavefront    - generic
11394     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
11395                                              - generic     glc=1
11396
11397                                                           - If CU wavefront execution
11398                                                             mode, omit glc=1.
11399
11400     load atomic  monotonic    - singlethread - local    1. ds_load
11401                               - wavefront
11402                               - workgroup
11403     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
11404                               - system       - generic     glc=1 dlc=1
11405
11406                                                           - If GFX11, omit dlc=1.
11407
11408     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
11409                               - wavefront    - generic
11410                               - workgroup
11411                               - agent
11412                               - system
11413     store atomic monotonic    - singlethread - local    1. ds_store
11414                               - wavefront
11415                               - workgroup
11416     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
11417                               - wavefront    - generic
11418                               - workgroup
11419                               - agent
11420                               - system
11421     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
11422                               - wavefront
11423                               - workgroup
11424     **Acquire Atomic**
11425     ------------------------------------------------------------------------------------
11426     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
11427                               - wavefront    - local
11428                                              - generic
11429     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
11430
11431                                                           - If CU wavefront execution
11432                                                             mode, omit glc=1.
11433
11434                                                         2. s_waitcnt vmcnt(0)
11435
11436                                                           - If CU wavefront execution
11437                                                             mode, omit.
11438                                                           - Must happen before
11439                                                             the following buffer_gl0_inv
11440                                                             and before any following
11441                                                             global/generic
11442                                                             load/load
11443                                                             atomic/store/store
11444                                                             atomic/atomicrmw.
11445
11446                                                         3. buffer_gl0_inv
11447
11448                                                           - If CU wavefront execution
11449                                                             mode, omit.
11450                                                           - Ensures that
11451                                                             following
11452                                                             loads will not see
11453                                                             stale data.
11454
11455     load atomic  acquire      - workgroup    - local    1. ds_load
11456                                                         2. s_waitcnt lgkmcnt(0)
11457
11458                                                           - If OpenCL, omit.
11459                                                           - Must happen before
11460                                                             the following buffer_gl0_inv
11461                                                             and before any following
11462                                                             global/generic load/load
11463                                                             atomic/store/store
11464                                                             atomic/atomicrmw.
11465                                                           - Ensures any
11466                                                             following global
11467                                                             data read is no
11468                                                             older than the local load
11469                                                             atomic value being
11470                                                             acquired.
11471
11472                                                         3. buffer_gl0_inv
11473
11474                                                           - If CU wavefront execution
11475                                                             mode, omit.
11476                                                           - If OpenCL, omit.
11477                                                           - Ensures that
11478                                                             following
11479                                                             loads will not see
11480                                                             stale data.
11481
11482     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
11483
11484                                                           - If CU wavefront execution
11485                                                             mode, omit glc=1.
11486
11487                                                         2. s_waitcnt lgkmcnt(0) &
11488                                                            vmcnt(0)
11489
11490                                                           - If CU wavefront execution
11491                                                             mode, omit vmcnt(0).
11492                                                           - If OpenCL, omit
11493                                                             lgkmcnt(0).
11494                                                           - Must happen before
11495                                                             the following
11496                                                             buffer_gl0_inv and any
11497                                                             following global/generic
11498                                                             load/load
11499                                                             atomic/store/store
11500                                                             atomic/atomicrmw.
11501                                                           - Ensures any
11502                                                             following global
11503                                                             data read is no
11504                                                             older than a local load
11505                                                             atomic value being
11506                                                             acquired.
11507
11508                                                         3. buffer_gl0_inv
11509
11510                                                           - If CU wavefront execution
11511                                                             mode, omit.
11512                                                           - Ensures that
11513                                                             following
11514                                                             loads will not see
11515                                                             stale data.
11516
11517     load atomic  acquire      - agent        - global   1. buffer/global_load
11518                               - system                     glc=1 dlc=1
11519
11520                                                           - If GFX11, omit dlc=1.
11521
11522                                                         2. s_waitcnt vmcnt(0)
11523
11524                                                           - Must happen before
11525                                                             following
11526                                                             buffer_gl*_inv.
11527                                                           - Ensures the load
11528                                                             has completed
11529                                                             before invalidating
11530                                                             the caches.
11531
11532                                                         3. buffer_gl0_inv;
11533                                                            buffer_gl1_inv
11534
11535                                                           - Must happen before
11536                                                             any following
11537                                                             global/generic
11538                                                             load/load
11539                                                             atomic/atomicrmw.
11540                                                           - Ensures that
11541                                                             following
11542                                                             loads will not see
11543                                                             stale global data.
11544
11545     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
11546                               - system
11547                                                           - If GFX11, omit dlc=1.
11548
11549                                                         2. s_waitcnt vmcnt(0) &
11550                                                            lgkmcnt(0)
11551
11552                                                           - If OpenCL omit
11553                                                             lgkmcnt(0).
11554                                                           - Must happen before
11555                                                             following
11556                                                             buffer_gl*_invl.
11557                                                           - Ensures the flat_load
11558                                                             has completed
11559                                                             before invalidating
11560                                                             the caches.
11561
11562                                                         3. buffer_gl0_inv;
11563                                                            buffer_gl1_inv
11564
11565                                                           - Must happen before
11566                                                             any following
11567                                                             global/generic
11568                                                             load/load
11569                                                             atomic/atomicrmw.
11570                                                           - Ensures that
11571                                                             following loads
11572                                                             will not see stale
11573                                                             global data.
11574
11575     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
11576                               - wavefront    - local
11577                                              - generic
11578     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
11579                                                         2. s_waitcnt vm/vscnt(0)
11580
11581                                                           - If CU wavefront execution
11582                                                             mode, omit.
11583                                                           - Use vmcnt(0) if atomic with
11584                                                             return and vscnt(0) if
11585                                                             atomic with no-return.
11586                                                           - Must happen before
11587                                                             the following buffer_gl0_inv
11588                                                             and before any following
11589                                                             global/generic
11590                                                             load/load
11591                                                             atomic/store/store
11592                                                             atomic/atomicrmw.
11593
11594                                                         3. buffer_gl0_inv
11595
11596                                                           - If CU wavefront execution
11597                                                             mode, omit.
11598                                                           - Ensures that
11599                                                             following
11600                                                             loads will not see
11601                                                             stale data.
11602
11603     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
11604                                                         2. s_waitcnt lgkmcnt(0)
11605
11606                                                           - If OpenCL, omit.
11607                                                           - Must happen before
11608                                                             the following
11609                                                             buffer_gl0_inv.
11610                                                           - Ensures any
11611                                                             following global
11612                                                             data read is no
11613                                                             older than the local
11614                                                             atomicrmw value
11615                                                             being acquired.
11616
11617                                                         3. buffer_gl0_inv
11618
11619                                                           - If OpenCL omit.
11620                                                           - Ensures that
11621                                                             following
11622                                                             loads will not see
11623                                                             stale data.
11624
11625     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
11626                                                         2. s_waitcnt lgkmcnt(0) &
11627                                                            vm/vscnt(0)
11628
11629                                                           - If CU wavefront execution
11630                                                             mode, omit vm/vscnt(0).
11631                                                           - If OpenCL, omit lgkmcnt(0).
11632                                                           - Use vmcnt(0) if atomic with
11633                                                             return and vscnt(0) if
11634                                                             atomic with no-return.
11635                                                           - Must happen before
11636                                                             the following
11637                                                             buffer_gl0_inv.
11638                                                           - Ensures any
11639                                                             following global
11640                                                             data read is no
11641                                                             older than a local
11642                                                             atomicrmw value
11643                                                             being acquired.
11644
11645                                                         3. buffer_gl0_inv
11646
11647                                                           - If CU wavefront execution
11648                                                             mode, omit.
11649                                                           - Ensures that
11650                                                             following
11651                                                             loads will not see
11652                                                             stale data.
11653
11654     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
11655                               - system                  2. s_waitcnt vm/vscnt(0)
11656
11657                                                           - Use vmcnt(0) if atomic with
11658                                                             return and vscnt(0) if
11659                                                             atomic with no-return.
11660                                                           - Must happen before
11661                                                             following
11662                                                             buffer_gl*_inv.
11663                                                           - Ensures the
11664                                                             atomicrmw has
11665                                                             completed before
11666                                                             invalidating the
11667                                                             caches.
11668
11669                                                         3. buffer_gl0_inv;
11670                                                            buffer_gl1_inv
11671
11672                                                           - Must happen before
11673                                                             any following
11674                                                             global/generic
11675                                                             load/load
11676                                                             atomic/atomicrmw.
11677                                                           - Ensures that
11678                                                             following loads
11679                                                             will not see stale
11680                                                             global data.
11681
11682     atomicrmw    acquire      - agent        - generic  1. flat_atomic
11683                               - system                  2. s_waitcnt vm/vscnt(0) &
11684                                                            lgkmcnt(0)
11685
11686                                                           - If OpenCL, omit
11687                                                             lgkmcnt(0).
11688                                                           - Use vmcnt(0) if atomic with
11689                                                             return and vscnt(0) if
11690                                                             atomic with no-return.
11691                                                           - Must happen before
11692                                                             following
11693                                                             buffer_gl*_inv.
11694                                                           - Ensures the
11695                                                             atomicrmw has
11696                                                             completed before
11697                                                             invalidating the
11698                                                             caches.
11699
11700                                                         3. buffer_gl0_inv;
11701                                                            buffer_gl1_inv
11702
11703                                                           - Must happen before
11704                                                             any following
11705                                                             global/generic
11706                                                             load/load
11707                                                             atomic/atomicrmw.
11708                                                           - Ensures that
11709                                                             following loads
11710                                                             will not see stale
11711                                                             global data.
11712
11713     fence        acquire      - singlethread *none*     *none*
11714                               - wavefront
11715     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
11716                                                            vmcnt(0) & vscnt(0)
11717
11718                                                           - If CU wavefront execution
11719                                                             mode, omit vmcnt(0) and
11720                                                             vscnt(0).
11721                                                           - If OpenCL and
11722                                                             address space is
11723                                                             not generic, omit
11724                                                             lgkmcnt(0).
11725                                                           - If OpenCL and
11726                                                             address space is
11727                                                             local, omit
11728                                                             vmcnt(0) and vscnt(0).
11729                                                           - However, since LLVM
11730                                                             currently has no
11731                                                             address space on
11732                                                             the fence need to
11733                                                             conservatively
11734                                                             always generate. If
11735                                                             fence had an
11736                                                             address space then
11737                                                             set to address
11738                                                             space of OpenCL
11739                                                             fence flag, or to
11740                                                             generic if both
11741                                                             local and global
11742                                                             flags are
11743                                                             specified.
11744                                                           - Could be split into
11745                                                             separate s_waitcnt
11746                                                             vmcnt(0), s_waitcnt
11747                                                             vscnt(0) and s_waitcnt
11748                                                             lgkmcnt(0) to allow
11749                                                             them to be
11750                                                             independently moved
11751                                                             according to the
11752                                                             following rules.
11753                                                           - s_waitcnt vmcnt(0)
11754                                                             must happen after
11755                                                             any preceding
11756                                                             global/generic load
11757                                                             atomic/
11758                                                             atomicrmw-with-return-value
11759                                                             with an equal or
11760                                                             wider sync scope
11761                                                             and memory ordering
11762                                                             stronger than
11763                                                             unordered (this is
11764                                                             termed the
11765                                                             fence-paired-atomic).
11766                                                           - s_waitcnt vscnt(0)
11767                                                             must happen after
11768                                                             any preceding
11769                                                             global/generic
11770                                                             atomicrmw-no-return-value
11771                                                             with an equal or
11772                                                             wider sync scope
11773                                                             and memory ordering
11774                                                             stronger than
11775                                                             unordered (this is
11776                                                             termed the
11777                                                             fence-paired-atomic).
11778                                                           - s_waitcnt lgkmcnt(0)
11779                                                             must happen after
11780                                                             any preceding
11781                                                             local/generic load
11782                                                             atomic/atomicrmw
11783                                                             with an equal or
11784                                                             wider sync scope
11785                                                             and memory ordering
11786                                                             stronger than
11787                                                             unordered (this is
11788                                                             termed the
11789                                                             fence-paired-atomic).
11790                                                           - Must happen before
11791                                                             the following
11792                                                             buffer_gl0_inv.
11793                                                           - Ensures that the
11794                                                             fence-paired atomic
11795                                                             has completed
11796                                                             before invalidating
11797                                                             the
11798                                                             cache. Therefore
11799                                                             any following
11800                                                             locations read must
11801                                                             be no older than
11802                                                             the value read by
11803                                                             the
11804                                                             fence-paired-atomic.
11805
11806                                                         3. buffer_gl0_inv
11807
11808                                                           - If CU wavefront execution
11809                                                             mode, omit.
11810                                                           - Ensures that
11811                                                             following
11812                                                             loads will not see
11813                                                             stale data.
11814
11815     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
11816                               - system                     vmcnt(0) & vscnt(0)
11817
11818                                                           - If OpenCL and
11819                                                             address space is
11820                                                             not generic, omit
11821                                                             lgkmcnt(0).
11822                                                           - If OpenCL and
11823                                                             address space is
11824                                                             local, omit
11825                                                             vmcnt(0) and vscnt(0).
11826                                                           - However, since LLVM
11827                                                             currently has no
11828                                                             address space on
11829                                                             the fence need to
11830                                                             conservatively
11831                                                             always generate
11832                                                             (see comment for
11833                                                             previous fence).
11834                                                           - Could be split into
11835                                                             separate s_waitcnt
11836                                                             vmcnt(0), s_waitcnt
11837                                                             vscnt(0) and s_waitcnt
11838                                                             lgkmcnt(0) to allow
11839                                                             them to be
11840                                                             independently moved
11841                                                             according to the
11842                                                             following rules.
11843                                                           - s_waitcnt vmcnt(0)
11844                                                             must happen after
11845                                                             any preceding
11846                                                             global/generic load
11847                                                             atomic/
11848                                                             atomicrmw-with-return-value
11849                                                             with an equal or
11850                                                             wider sync scope
11851                                                             and memory ordering
11852                                                             stronger than
11853                                                             unordered (this is
11854                                                             termed the
11855                                                             fence-paired-atomic).
11856                                                           - s_waitcnt vscnt(0)
11857                                                             must happen after
11858                                                             any preceding
11859                                                             global/generic
11860                                                             atomicrmw-no-return-value
11861                                                             with an equal or
11862                                                             wider sync scope
11863                                                             and memory ordering
11864                                                             stronger than
11865                                                             unordered (this is
11866                                                             termed the
11867                                                             fence-paired-atomic).
11868                                                           - s_waitcnt lgkmcnt(0)
11869                                                             must happen after
11870                                                             any preceding
11871                                                             local/generic load
11872                                                             atomic/atomicrmw
11873                                                             with an equal or
11874                                                             wider sync scope
11875                                                             and memory ordering
11876                                                             stronger than
11877                                                             unordered (this is
11878                                                             termed the
11879                                                             fence-paired-atomic).
11880                                                           - Must happen before
11881                                                             the following
11882                                                             buffer_gl*_inv.
11883                                                           - Ensures that the
11884                                                             fence-paired atomic
11885                                                             has completed
11886                                                             before invalidating
11887                                                             the
11888                                                             caches. Therefore
11889                                                             any following
11890                                                             locations read must
11891                                                             be no older than
11892                                                             the value read by
11893                                                             the
11894                                                             fence-paired-atomic.
11895
11896                                                         2. buffer_gl0_inv;
11897                                                            buffer_gl1_inv
11898
11899                                                           - Must happen before any
11900                                                             following global/generic
11901                                                             load/load
11902                                                             atomic/store/store
11903                                                             atomic/atomicrmw.
11904                                                           - Ensures that
11905                                                             following loads
11906                                                             will not see stale
11907                                                             global data.
11908
11909     **Release Atomic**
11910     ------------------------------------------------------------------------------------
11911     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
11912                               - wavefront    - local
11913                                              - generic
11914     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
11915                                              - generic     vmcnt(0) & vscnt(0)
11916
11917                                                           - If CU wavefront execution
11918                                                             mode, omit vmcnt(0) and
11919                                                             vscnt(0).
11920                                                           - If OpenCL, omit
11921                                                             lgkmcnt(0).
11922                                                           - Could be split into
11923                                                             separate s_waitcnt
11924                                                             vmcnt(0), s_waitcnt
11925                                                             vscnt(0) and s_waitcnt
11926                                                             lgkmcnt(0) to allow
11927                                                             them to be
11928                                                             independently moved
11929                                                             according to the
11930                                                             following rules.
11931                                                           - s_waitcnt vmcnt(0)
11932                                                             must happen after
11933                                                             any preceding
11934                                                             global/generic load/load
11935                                                             atomic/
11936                                                             atomicrmw-with-return-value.
11937                                                           - s_waitcnt vscnt(0)
11938                                                             must happen after
11939                                                             any preceding
11940                                                             global/generic
11941                                                             store/store
11942                                                             atomic/
11943                                                             atomicrmw-no-return-value.
11944                                                           - s_waitcnt lgkmcnt(0)
11945                                                             must happen after
11946                                                             any preceding
11947                                                             local/generic
11948                                                             load/store/load
11949                                                             atomic/store
11950                                                             atomic/atomicrmw.
11951                                                           - Must happen before
11952                                                             the following
11953                                                             store.
11954                                                           - Ensures that all
11955                                                             memory operations
11956                                                             have
11957                                                             completed before
11958                                                             performing the
11959                                                             store that is being
11960                                                             released.
11961
11962                                                         2. buffer/global/flat_store
11963     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
11964
11965                                                           - If CU wavefront execution
11966                                                             mode, omit.
11967                                                           - If OpenCL, omit.
11968                                                           - Could be split into
11969                                                             separate s_waitcnt
11970                                                             vmcnt(0) and s_waitcnt
11971                                                             vscnt(0) to allow
11972                                                             them to be
11973                                                             independently moved
11974                                                             according to the
11975                                                             following rules.
11976                                                           - s_waitcnt vmcnt(0)
11977                                                             must happen after
11978                                                             any preceding
11979                                                             global/generic load/load
11980                                                             atomic/
11981                                                             atomicrmw-with-return-value.
11982                                                           - s_waitcnt vscnt(0)
11983                                                             must happen after
11984                                                             any preceding
11985                                                             global/generic
11986                                                             store/store atomic/
11987                                                             atomicrmw-no-return-value.
11988                                                           - Must happen before
11989                                                             the following
11990                                                             store.
11991                                                           - Ensures that all
11992                                                             global memory
11993                                                             operations have
11994                                                             completed before
11995                                                             performing the
11996                                                             store that is being
11997                                                             released.
11998
11999                                                         2. ds_store
12000     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12001                               - system       - generic     vmcnt(0) & vscnt(0)
12002
12003                                                           - If OpenCL and
12004                                                             address space is
12005                                                             not generic, omit
12006                                                             lgkmcnt(0).
12007                                                           - Could be split into
12008                                                             separate s_waitcnt
12009                                                             vmcnt(0), s_waitcnt vscnt(0)
12010                                                             and s_waitcnt
12011                                                             lgkmcnt(0) to allow
12012                                                             them to be
12013                                                             independently moved
12014                                                             according to the
12015                                                             following rules.
12016                                                           - s_waitcnt vmcnt(0)
12017                                                             must happen after
12018                                                             any preceding
12019                                                             global/generic
12020                                                             load/load
12021                                                             atomic/
12022                                                             atomicrmw-with-return-value.
12023                                                           - s_waitcnt vscnt(0)
12024                                                             must happen after
12025                                                             any preceding
12026                                                             global/generic
12027                                                             store/store atomic/
12028                                                             atomicrmw-no-return-value.
12029                                                           - s_waitcnt lgkmcnt(0)
12030                                                             must happen after
12031                                                             any preceding
12032                                                             local/generic
12033                                                             load/store/load
12034                                                             atomic/store
12035                                                             atomic/atomicrmw.
12036                                                           - Must happen before
12037                                                             the following
12038                                                             store.
12039                                                           - Ensures that all
12040                                                             memory operations
12041                                                             have
12042                                                             completed before
12043                                                             performing the
12044                                                             store that is being
12045                                                             released.
12046
12047                                                         2. buffer/global/flat_store
12048     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
12049                               - wavefront    - local
12050                                              - generic
12051     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12052                                              - generic     vmcnt(0) & vscnt(0)
12053
12054                                                           - If CU wavefront execution
12055                                                             mode, omit vmcnt(0) and
12056                                                             vscnt(0).
12057                                                           - If OpenCL, omit lgkmcnt(0).
12058                                                           - Could be split into
12059                                                             separate s_waitcnt
12060                                                             vmcnt(0), s_waitcnt
12061                                                             vscnt(0) and s_waitcnt
12062                                                             lgkmcnt(0) to allow
12063                                                             them to be
12064                                                             independently moved
12065                                                             according to the
12066                                                             following rules.
12067                                                           - s_waitcnt vmcnt(0)
12068                                                             must happen after
12069                                                             any preceding
12070                                                             global/generic load/load
12071                                                             atomic/
12072                                                             atomicrmw-with-return-value.
12073                                                           - s_waitcnt vscnt(0)
12074                                                             must happen after
12075                                                             any preceding
12076                                                             global/generic
12077                                                             store/store
12078                                                             atomic/
12079                                                             atomicrmw-no-return-value.
12080                                                           - s_waitcnt lgkmcnt(0)
12081                                                             must happen after
12082                                                             any preceding
12083                                                             local/generic
12084                                                             load/store/load
12085                                                             atomic/store
12086                                                             atomic/atomicrmw.
12087                                                           - Must happen before
12088                                                             the following
12089                                                             atomicrmw.
12090                                                           - Ensures that all
12091                                                             memory operations
12092                                                             have
12093                                                             completed before
12094                                                             performing the
12095                                                             atomicrmw that is
12096                                                             being released.
12097
12098                                                         2. buffer/global/flat_atomic
12099     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12100
12101                                                           - If CU wavefront execution
12102                                                             mode, omit.
12103                                                           - If OpenCL, omit.
12104                                                           - Could be split into
12105                                                             separate s_waitcnt
12106                                                             vmcnt(0) and s_waitcnt
12107                                                             vscnt(0) to allow
12108                                                             them to be
12109                                                             independently moved
12110                                                             according to the
12111                                                             following rules.
12112                                                           - s_waitcnt vmcnt(0)
12113                                                             must happen after
12114                                                             any preceding
12115                                                             global/generic load/load
12116                                                             atomic/
12117                                                             atomicrmw-with-return-value.
12118                                                           - s_waitcnt vscnt(0)
12119                                                             must happen after
12120                                                             any preceding
12121                                                             global/generic
12122                                                             store/store atomic/
12123                                                             atomicrmw-no-return-value.
12124                                                           - Must happen before
12125                                                             the following
12126                                                             store.
12127                                                           - Ensures that all
12128                                                             global memory
12129                                                             operations have
12130                                                             completed before
12131                                                             performing the
12132                                                             store that is being
12133                                                             released.
12134
12135                                                         2. ds_atomic
12136     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12137                               - system       - generic      vmcnt(0) & vscnt(0)
12138
12139                                                           - If OpenCL, omit
12140                                                             lgkmcnt(0).
12141                                                           - Could be split into
12142                                                             separate s_waitcnt
12143                                                             vmcnt(0), s_waitcnt
12144                                                             vscnt(0) and s_waitcnt
12145                                                             lgkmcnt(0) to allow
12146                                                             them to be
12147                                                             independently moved
12148                                                             according to the
12149                                                             following rules.
12150                                                           - s_waitcnt vmcnt(0)
12151                                                             must happen after
12152                                                             any preceding
12153                                                             global/generic
12154                                                             load/load atomic/
12155                                                             atomicrmw-with-return-value.
12156                                                           - s_waitcnt vscnt(0)
12157                                                             must happen after
12158                                                             any preceding
12159                                                             global/generic
12160                                                             store/store atomic/
12161                                                             atomicrmw-no-return-value.
12162                                                           - s_waitcnt lgkmcnt(0)
12163                                                             must happen after
12164                                                             any preceding
12165                                                             local/generic
12166                                                             load/store/load
12167                                                             atomic/store
12168                                                             atomic/atomicrmw.
12169                                                           - Must happen before
12170                                                             the following
12171                                                             atomicrmw.
12172                                                           - Ensures that all
12173                                                             memory operations
12174                                                             to global and local
12175                                                             have completed
12176                                                             before performing
12177                                                             the atomicrmw that
12178                                                             is being released.
12179
12180                                                         2. buffer/global/flat_atomic
12181     fence        release      - singlethread *none*     *none*
12182                               - wavefront
12183     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12184                                                            vmcnt(0) & vscnt(0)
12185
12186                                                           - If CU wavefront execution
12187                                                             mode, omit vmcnt(0) and
12188                                                             vscnt(0).
12189                                                           - If OpenCL and
12190                                                             address space is
12191                                                             not generic, omit
12192                                                             lgkmcnt(0).
12193                                                           - If OpenCL and
12194                                                             address space is
12195                                                             local, omit
12196                                                             vmcnt(0) and vscnt(0).
12197                                                           - However, since LLVM
12198                                                             currently has no
12199                                                             address space on
12200                                                             the fence need to
12201                                                             conservatively
12202                                                             always generate. If
12203                                                             fence had an
12204                                                             address space then
12205                                                             set to address
12206                                                             space of OpenCL
12207                                                             fence flag, or to
12208                                                             generic if both
12209                                                             local and global
12210                                                             flags are
12211                                                             specified.
12212                                                           - Could be split into
12213                                                             separate s_waitcnt
12214                                                             vmcnt(0), s_waitcnt
12215                                                             vscnt(0) and s_waitcnt
12216                                                             lgkmcnt(0) to allow
12217                                                             them to be
12218                                                             independently moved
12219                                                             according to the
12220                                                             following rules.
12221                                                           - s_waitcnt vmcnt(0)
12222                                                             must happen after
12223                                                             any preceding
12224                                                             global/generic
12225                                                             load/load
12226                                                             atomic/
12227                                                             atomicrmw-with-return-value.
12228                                                           - s_waitcnt vscnt(0)
12229                                                             must happen after
12230                                                             any preceding
12231                                                             global/generic
12232                                                             store/store atomic/
12233                                                             atomicrmw-no-return-value.
12234                                                           - s_waitcnt lgkmcnt(0)
12235                                                             must happen after
12236                                                             any preceding
12237                                                             local/generic
12238                                                             load/store/load
12239                                                             atomic/store atomic/
12240                                                             atomicrmw.
12241                                                           - Must happen before
12242                                                             any following store
12243                                                             atomic/atomicrmw
12244                                                             with an equal or
12245                                                             wider sync scope
12246                                                             and memory ordering
12247                                                             stronger than
12248                                                             unordered (this is
12249                                                             termed the
12250                                                             fence-paired-atomic).
12251                                                           - Ensures that all
12252                                                             memory operations
12253                                                             have
12254                                                             completed before
12255                                                             performing the
12256                                                             following
12257                                                             fence-paired-atomic.
12258
12259     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12260                               - system                     vmcnt(0) & vscnt(0)
12261
12262                                                           - If OpenCL and
12263                                                             address space is
12264                                                             not generic, omit
12265                                                             lgkmcnt(0).
12266                                                           - If OpenCL and
12267                                                             address space is
12268                                                             local, omit
12269                                                             vmcnt(0) and vscnt(0).
12270                                                           - However, since LLVM
12271                                                             currently has no
12272                                                             address space on
12273                                                             the fence need to
12274                                                             conservatively
12275                                                             always generate. If
12276                                                             fence had an
12277                                                             address space then
12278                                                             set to address
12279                                                             space of OpenCL
12280                                                             fence flag, or to
12281                                                             generic if both
12282                                                             local and global
12283                                                             flags are
12284                                                             specified.
12285                                                           - Could be split into
12286                                                             separate s_waitcnt
12287                                                             vmcnt(0), s_waitcnt
12288                                                             vscnt(0) and s_waitcnt
12289                                                             lgkmcnt(0) to allow
12290                                                             them to be
12291                                                             independently moved
12292                                                             according to the
12293                                                             following rules.
12294                                                           - s_waitcnt vmcnt(0)
12295                                                             must happen after
12296                                                             any preceding
12297                                                             global/generic
12298                                                             load/load atomic/
12299                                                             atomicrmw-with-return-value.
12300                                                           - s_waitcnt vscnt(0)
12301                                                             must happen after
12302                                                             any preceding
12303                                                             global/generic
12304                                                             store/store atomic/
12305                                                             atomicrmw-no-return-value.
12306                                                           - s_waitcnt lgkmcnt(0)
12307                                                             must happen after
12308                                                             any preceding
12309                                                             local/generic
12310                                                             load/store/load
12311                                                             atomic/store
12312                                                             atomic/atomicrmw.
12313                                                           - Must happen before
12314                                                             any following store
12315                                                             atomic/atomicrmw
12316                                                             with an equal or
12317                                                             wider sync scope
12318                                                             and memory ordering
12319                                                             stronger than
12320                                                             unordered (this is
12321                                                             termed the
12322                                                             fence-paired-atomic).
12323                                                           - Ensures that all
12324                                                             memory operations
12325                                                             have
12326                                                             completed before
12327                                                             performing the
12328                                                             following
12329                                                             fence-paired-atomic.
12330
12331     **Acquire-Release Atomic**
12332     ------------------------------------------------------------------------------------
12333     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
12334                               - wavefront    - local
12335                                              - generic
12336     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12337                                                            vmcnt(0) & vscnt(0)
12338
12339                                                           - If CU wavefront execution
12340                                                             mode, omit vmcnt(0) and
12341                                                             vscnt(0).
12342                                                           - If OpenCL, omit
12343                                                             lgkmcnt(0).
12344                                                           - Must happen after
12345                                                             any preceding
12346                                                             local/generic
12347                                                             load/store/load
12348                                                             atomic/store
12349                                                             atomic/atomicrmw.
12350                                                           - Could be split into
12351                                                             separate s_waitcnt
12352                                                             vmcnt(0), s_waitcnt
12353                                                             vscnt(0), and s_waitcnt
12354                                                             lgkmcnt(0) to allow
12355                                                             them to be
12356                                                             independently moved
12357                                                             according to the
12358                                                             following rules.
12359                                                           - s_waitcnt vmcnt(0)
12360                                                             must happen after
12361                                                             any preceding
12362                                                             global/generic load/load
12363                                                             atomic/
12364                                                             atomicrmw-with-return-value.
12365                                                           - s_waitcnt vscnt(0)
12366                                                             must happen after
12367                                                             any preceding
12368                                                             global/generic
12369                                                             store/store
12370                                                             atomic/
12371                                                             atomicrmw-no-return-value.
12372                                                           - s_waitcnt lgkmcnt(0)
12373                                                             must happen after
12374                                                             any preceding
12375                                                             local/generic
12376                                                             load/store/load
12377                                                             atomic/store
12378                                                             atomic/atomicrmw.
12379                                                           - Must happen before
12380                                                             the following
12381                                                             atomicrmw.
12382                                                           - Ensures that all
12383                                                             memory operations
12384                                                             have
12385                                                             completed before
12386                                                             performing the
12387                                                             atomicrmw that is
12388                                                             being released.
12389
12390                                                         2. buffer/global_atomic
12391                                                         3. s_waitcnt vm/vscnt(0)
12392
12393                                                           - If CU wavefront execution
12394                                                             mode, omit.
12395                                                           - Use vmcnt(0) if atomic with
12396                                                             return and vscnt(0) if
12397                                                             atomic with no-return.
12398                                                           - Must happen before
12399                                                             the following
12400                                                             buffer_gl0_inv.
12401                                                           - Ensures any
12402                                                             following global
12403                                                             data read is no
12404                                                             older than the
12405                                                             atomicrmw value
12406                                                             being acquired.
12407
12408                                                         4. buffer_gl0_inv
12409
12410                                                           - If CU wavefront execution
12411                                                             mode, omit.
12412                                                           - Ensures that
12413                                                             following
12414                                                             loads will not see
12415                                                             stale data.
12416
12417     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12418
12419                                                           - If CU wavefront execution
12420                                                             mode, omit.
12421                                                           - If OpenCL, omit.
12422                                                           - Could be split into
12423                                                             separate s_waitcnt
12424                                                             vmcnt(0) and s_waitcnt
12425                                                             vscnt(0) to allow
12426                                                             them to be
12427                                                             independently moved
12428                                                             according to the
12429                                                             following rules.
12430                                                           - s_waitcnt vmcnt(0)
12431                                                             must happen after
12432                                                             any preceding
12433                                                             global/generic load/load
12434                                                             atomic/
12435                                                             atomicrmw-with-return-value.
12436                                                           - s_waitcnt vscnt(0)
12437                                                             must happen after
12438                                                             any preceding
12439                                                             global/generic
12440                                                             store/store atomic/
12441                                                             atomicrmw-no-return-value.
12442                                                           - Must happen before
12443                                                             the following
12444                                                             store.
12445                                                           - Ensures that all
12446                                                             global memory
12447                                                             operations have
12448                                                             completed before
12449                                                             performing the
12450                                                             store that is being
12451                                                             released.
12452
12453                                                         2. ds_atomic
12454                                                         3. s_waitcnt lgkmcnt(0)
12455
12456                                                           - If OpenCL, omit.
12457                                                           - Must happen before
12458                                                             the following
12459                                                             buffer_gl0_inv.
12460                                                           - Ensures any
12461                                                             following global
12462                                                             data read is no
12463                                                             older than the local load
12464                                                             atomic value being
12465                                                             acquired.
12466
12467                                                         4. buffer_gl0_inv
12468
12469                                                           - If CU wavefront execution
12470                                                             mode, omit.
12471                                                           - If OpenCL omit.
12472                                                           - Ensures that
12473                                                             following
12474                                                             loads will not see
12475                                                             stale data.
12476
12477     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
12478                                                            vmcnt(0) & vscnt(0)
12479
12480                                                           - If CU wavefront execution
12481                                                             mode, omit vmcnt(0) and
12482                                                             vscnt(0).
12483                                                           - If OpenCL, omit lgkmcnt(0).
12484                                                           - Could be split into
12485                                                             separate s_waitcnt
12486                                                             vmcnt(0), s_waitcnt
12487                                                             vscnt(0) and s_waitcnt
12488                                                             lgkmcnt(0) to allow
12489                                                             them to be
12490                                                             independently moved
12491                                                             according to the
12492                                                             following rules.
12493                                                           - s_waitcnt vmcnt(0)
12494                                                             must happen after
12495                                                             any preceding
12496                                                             global/generic load/load
12497                                                             atomic/
12498                                                             atomicrmw-with-return-value.
12499                                                           - s_waitcnt vscnt(0)
12500                                                             must happen after
12501                                                             any preceding
12502                                                             global/generic
12503                                                             store/store
12504                                                             atomic/
12505                                                             atomicrmw-no-return-value.
12506                                                           - s_waitcnt lgkmcnt(0)
12507                                                             must happen after
12508                                                             any preceding
12509                                                             local/generic
12510                                                             load/store/load
12511                                                             atomic/store
12512                                                             atomic/atomicrmw.
12513                                                           - Must happen before
12514                                                             the following
12515                                                             atomicrmw.
12516                                                           - Ensures that all
12517                                                             memory operations
12518                                                             have
12519                                                             completed before
12520                                                             performing the
12521                                                             atomicrmw that is
12522                                                             being released.
12523
12524                                                         2. flat_atomic
12525                                                         3. s_waitcnt lgkmcnt(0) &
12526                                                            vmcnt(0) & vscnt(0)
12527
12528                                                           - If CU wavefront execution
12529                                                             mode, omit vmcnt(0) and
12530                                                             vscnt(0).
12531                                                           - If OpenCL, omit lgkmcnt(0).
12532                                                           - Must happen before
12533                                                             the following
12534                                                             buffer_gl0_inv.
12535                                                           - Ensures any
12536                                                             following global
12537                                                             data read is no
12538                                                             older than the load
12539                                                             atomic value being
12540                                                             acquired.
12541
12542                                                         3. buffer_gl0_inv
12543
12544                                                           - If CU wavefront execution
12545                                                             mode, omit.
12546                                                           - Ensures that
12547                                                             following
12548                                                             loads will not see
12549                                                             stale data.
12550
12551     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12552                               - system                     vmcnt(0) & vscnt(0)
12553
12554                                                           - If OpenCL, omit
12555                                                             lgkmcnt(0).
12556                                                           - Could be split into
12557                                                             separate s_waitcnt
12558                                                             vmcnt(0), s_waitcnt
12559                                                             vscnt(0) and s_waitcnt
12560                                                             lgkmcnt(0) to allow
12561                                                             them to be
12562                                                             independently moved
12563                                                             according to the
12564                                                             following rules.
12565                                                           - s_waitcnt vmcnt(0)
12566                                                             must happen after
12567                                                             any preceding
12568                                                             global/generic
12569                                                             load/load atomic/
12570                                                             atomicrmw-with-return-value.
12571                                                           - s_waitcnt vscnt(0)
12572                                                             must happen after
12573                                                             any preceding
12574                                                             global/generic
12575                                                             store/store atomic/
12576                                                             atomicrmw-no-return-value.
12577                                                           - s_waitcnt lgkmcnt(0)
12578                                                             must happen after
12579                                                             any preceding
12580                                                             local/generic
12581                                                             load/store/load
12582                                                             atomic/store
12583                                                             atomic/atomicrmw.
12584                                                           - Must happen before
12585                                                             the following
12586                                                             atomicrmw.
12587                                                           - Ensures that all
12588                                                             memory operations
12589                                                             to global have
12590                                                             completed before
12591                                                             performing the
12592                                                             atomicrmw that is
12593                                                             being released.
12594
12595                                                         2. buffer/global_atomic
12596                                                         3. s_waitcnt vm/vscnt(0)
12597
12598                                                           - Use vmcnt(0) if atomic with
12599                                                             return and vscnt(0) if
12600                                                             atomic with no-return.
12601                                                           - Must happen before
12602                                                             following
12603                                                             buffer_gl*_inv.
12604                                                           - Ensures the
12605                                                             atomicrmw has
12606                                                             completed before
12607                                                             invalidating the
12608                                                             caches.
12609
12610                                                         4. buffer_gl0_inv;
12611                                                            buffer_gl1_inv
12612
12613                                                           - Must happen before
12614                                                             any following
12615                                                             global/generic
12616                                                             load/load
12617                                                             atomic/atomicrmw.
12618                                                           - Ensures that
12619                                                             following loads
12620                                                             will not see stale
12621                                                             global data.
12622
12623     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
12624                               - system                     vmcnt(0) & vscnt(0)
12625
12626                                                           - If OpenCL, omit
12627                                                             lgkmcnt(0).
12628                                                           - Could be split into
12629                                                             separate s_waitcnt
12630                                                             vmcnt(0), s_waitcnt
12631                                                             vscnt(0), and s_waitcnt
12632                                                             lgkmcnt(0) to allow
12633                                                             them to be
12634                                                             independently moved
12635                                                             according to the
12636                                                             following rules.
12637                                                           - s_waitcnt vmcnt(0)
12638                                                             must happen after
12639                                                             any preceding
12640                                                             global/generic
12641                                                             load/load atomic
12642                                                             atomicrmw-with-return-value.
12643                                                           - s_waitcnt vscnt(0)
12644                                                             must happen after
12645                                                             any preceding
12646                                                             global/generic
12647                                                             store/store atomic/
12648                                                             atomicrmw-no-return-value.
12649                                                           - s_waitcnt lgkmcnt(0)
12650                                                             must happen after
12651                                                             any preceding
12652                                                             local/generic
12653                                                             load/store/load
12654                                                             atomic/store
12655                                                             atomic/atomicrmw.
12656                                                           - Must happen before
12657                                                             the following
12658                                                             atomicrmw.
12659                                                           - Ensures that all
12660                                                             memory operations
12661                                                             have
12662                                                             completed before
12663                                                             performing the
12664                                                             atomicrmw that is
12665                                                             being released.
12666
12667                                                         2. flat_atomic
12668                                                         3. s_waitcnt vm/vscnt(0) &
12669                                                            lgkmcnt(0)
12670
12671                                                           - If OpenCL, omit
12672                                                             lgkmcnt(0).
12673                                                           - Use vmcnt(0) if atomic with
12674                                                             return and vscnt(0) if
12675                                                             atomic with no-return.
12676                                                           - Must happen before
12677                                                             following
12678                                                             buffer_gl*_inv.
12679                                                           - Ensures the
12680                                                             atomicrmw has
12681                                                             completed before
12682                                                             invalidating the
12683                                                             caches.
12684
12685                                                         4. buffer_gl0_inv;
12686                                                            buffer_gl1_inv
12687
12688                                                           - Must happen before
12689                                                             any following
12690                                                             global/generic
12691                                                             load/load
12692                                                             atomic/atomicrmw.
12693                                                           - Ensures that
12694                                                             following loads
12695                                                             will not see stale
12696                                                             global data.
12697
12698     fence        acq_rel      - singlethread *none*     *none*
12699                               - wavefront
12700     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12701                                                            vmcnt(0) & vscnt(0)
12702
12703                                                           - If CU wavefront execution
12704                                                             mode, omit vmcnt(0) and
12705                                                             vscnt(0).
12706                                                           - If OpenCL and
12707                                                             address space is
12708                                                             not generic, omit
12709                                                             lgkmcnt(0).
12710                                                           - If OpenCL and
12711                                                             address space is
12712                                                             local, omit
12713                                                             vmcnt(0) and vscnt(0).
12714                                                           - However,
12715                                                             since LLVM
12716                                                             currently has no
12717                                                             address space on
12718                                                             the fence need to
12719                                                             conservatively
12720                                                             always generate
12721                                                             (see comment for
12722                                                             previous fence).
12723                                                           - Could be split into
12724                                                             separate s_waitcnt
12725                                                             vmcnt(0), s_waitcnt
12726                                                             vscnt(0) and s_waitcnt
12727                                                             lgkmcnt(0) to allow
12728                                                             them to be
12729                                                             independently moved
12730                                                             according to the
12731                                                             following rules.
12732                                                           - s_waitcnt vmcnt(0)
12733                                                             must happen after
12734                                                             any preceding
12735                                                             global/generic
12736                                                             load/load
12737                                                             atomic/
12738                                                             atomicrmw-with-return-value.
12739                                                           - s_waitcnt vscnt(0)
12740                                                             must happen after
12741                                                             any preceding
12742                                                             global/generic
12743                                                             store/store atomic/
12744                                                             atomicrmw-no-return-value.
12745                                                           - s_waitcnt lgkmcnt(0)
12746                                                             must happen after
12747                                                             any preceding
12748                                                             local/generic
12749                                                             load/store/load
12750                                                             atomic/store atomic/
12751                                                             atomicrmw.
12752                                                           - Must happen before
12753                                                             any following
12754                                                             global/generic
12755                                                             load/load
12756                                                             atomic/store/store
12757                                                             atomic/atomicrmw.
12758                                                           - Ensures that all
12759                                                             memory operations
12760                                                             have
12761                                                             completed before
12762                                                             performing any
12763                                                             following global
12764                                                             memory operations.
12765                                                           - Ensures that the
12766                                                             preceding
12767                                                             local/generic load
12768                                                             atomic/atomicrmw
12769                                                             with an equal or
12770                                                             wider sync scope
12771                                                             and memory ordering
12772                                                             stronger than
12773                                                             unordered (this is
12774                                                             termed the
12775                                                             acquire-fence-paired-atomic)
12776                                                             has completed
12777                                                             before following
12778                                                             global memory
12779                                                             operations. This
12780                                                             satisfies the
12781                                                             requirements of
12782                                                             acquire.
12783                                                           - Ensures that all
12784                                                             previous memory
12785                                                             operations have
12786                                                             completed before a
12787                                                             following
12788                                                             local/generic store
12789                                                             atomic/atomicrmw
12790                                                             with an equal or
12791                                                             wider sync scope
12792                                                             and memory ordering
12793                                                             stronger than
12794                                                             unordered (this is
12795                                                             termed the
12796                                                             release-fence-paired-atomic).
12797                                                             This satisfies the
12798                                                             requirements of
12799                                                             release.
12800                                                           - Must happen before
12801                                                             the following
12802                                                             buffer_gl0_inv.
12803                                                           - Ensures that the
12804                                                             acquire-fence-paired
12805                                                             atomic has completed
12806                                                             before invalidating
12807                                                             the
12808                                                             cache. Therefore
12809                                                             any following
12810                                                             locations read must
12811                                                             be no older than
12812                                                             the value read by
12813                                                             the
12814                                                             acquire-fence-paired-atomic.
12815
12816                                                         3. buffer_gl0_inv
12817
12818                                                           - If CU wavefront execution
12819                                                             mode, omit.
12820                                                           - Ensures that
12821                                                             following
12822                                                             loads will not see
12823                                                             stale data.
12824
12825     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12826                               - system                     vmcnt(0) & vscnt(0)
12827
12828                                                           - If OpenCL and
12829                                                             address space is
12830                                                             not generic, omit
12831                                                             lgkmcnt(0).
12832                                                           - If OpenCL and
12833                                                             address space is
12834                                                             local, omit
12835                                                             vmcnt(0) and vscnt(0).
12836                                                           - However, since LLVM
12837                                                             currently has no
12838                                                             address space on
12839                                                             the fence need to
12840                                                             conservatively
12841                                                             always generate
12842                                                             (see comment for
12843                                                             previous fence).
12844                                                           - Could be split into
12845                                                             separate s_waitcnt
12846                                                             vmcnt(0), s_waitcnt
12847                                                             vscnt(0) and s_waitcnt
12848                                                             lgkmcnt(0) to allow
12849                                                             them to be
12850                                                             independently moved
12851                                                             according to the
12852                                                             following rules.
12853                                                           - s_waitcnt vmcnt(0)
12854                                                             must happen after
12855                                                             any preceding
12856                                                             global/generic
12857                                                             load/load
12858                                                             atomic/
12859                                                             atomicrmw-with-return-value.
12860                                                           - s_waitcnt vscnt(0)
12861                                                             must happen after
12862                                                             any preceding
12863                                                             global/generic
12864                                                             store/store atomic/
12865                                                             atomicrmw-no-return-value.
12866                                                           - s_waitcnt lgkmcnt(0)
12867                                                             must happen after
12868                                                             any preceding
12869                                                             local/generic
12870                                                             load/store/load
12871                                                             atomic/store
12872                                                             atomic/atomicrmw.
12873                                                           - Must happen before
12874                                                             the following
12875                                                             buffer_gl*_inv.
12876                                                           - Ensures that the
12877                                                             preceding
12878                                                             global/local/generic
12879                                                             load
12880                                                             atomic/atomicrmw
12881                                                             with an equal or
12882                                                             wider sync scope
12883                                                             and memory ordering
12884                                                             stronger than
12885                                                             unordered (this is
12886                                                             termed the
12887                                                             acquire-fence-paired-atomic)
12888                                                             has completed
12889                                                             before invalidating
12890                                                             the caches. This
12891                                                             satisfies the
12892                                                             requirements of
12893                                                             acquire.
12894                                                           - Ensures that all
12895                                                             previous memory
12896                                                             operations have
12897                                                             completed before a
12898                                                             following
12899                                                             global/local/generic
12900                                                             store
12901                                                             atomic/atomicrmw
12902                                                             with an equal or
12903                                                             wider sync scope
12904                                                             and memory ordering
12905                                                             stronger than
12906                                                             unordered (this is
12907                                                             termed the
12908                                                             release-fence-paired-atomic).
12909                                                             This satisfies the
12910                                                             requirements of
12911                                                             release.
12912
12913                                                         2. buffer_gl0_inv;
12914                                                            buffer_gl1_inv
12915
12916                                                           - Must happen before
12917                                                             any following
12918                                                             global/generic
12919                                                             load/load
12920                                                             atomic/store/store
12921                                                             atomic/atomicrmw.
12922                                                           - Ensures that
12923                                                             following loads
12924                                                             will not see stale
12925                                                             global data. This
12926                                                             satisfies the
12927                                                             requirements of
12928                                                             acquire.
12929
12930     **Sequential Consistent Atomic**
12931     ------------------------------------------------------------------------------------
12932     load atomic  seq_cst      - singlethread - global   *Same as corresponding
12933                               - wavefront    - local    load atomic acquire,
12934                                              - generic  except must generate
12935                                                         all instructions even
12936                                                         for OpenCL.*
12937     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12938                                              - generic     vmcnt(0) & vscnt(0)
12939
12940                                                           - If CU wavefront execution
12941                                                             mode, omit vmcnt(0) and
12942                                                             vscnt(0).
12943                                                           - Could be split into
12944                                                             separate s_waitcnt
12945                                                             vmcnt(0), s_waitcnt
12946                                                             vscnt(0), and s_waitcnt
12947                                                             lgkmcnt(0) to allow
12948                                                             them to be
12949                                                             independently moved
12950                                                             according to the
12951                                                             following rules.
12952                                                           - s_waitcnt lgkmcnt(0) must
12953                                                             happen after
12954                                                             preceding
12955                                                             local/generic load
12956                                                             atomic/store
12957                                                             atomic/atomicrmw
12958                                                             with memory
12959                                                             ordering of seq_cst
12960                                                             and with equal or
12961                                                             wider sync scope.
12962                                                             (Note that seq_cst
12963                                                             fences have their
12964                                                             own s_waitcnt
12965                                                             lgkmcnt(0) and so do
12966                                                             not need to be
12967                                                             considered.)
12968                                                           - s_waitcnt vmcnt(0)
12969                                                             must happen after
12970                                                             preceding
12971                                                             global/generic load
12972                                                             atomic/
12973                                                             atomicrmw-with-return-value
12974                                                             with memory
12975                                                             ordering of seq_cst
12976                                                             and with equal or
12977                                                             wider sync scope.
12978                                                             (Note that seq_cst
12979                                                             fences have their
12980                                                             own s_waitcnt
12981                                                             vmcnt(0) and so do
12982                                                             not need to be
12983                                                             considered.)
12984                                                           - s_waitcnt vscnt(0)
12985                                                             Must happen after
12986                                                             preceding
12987                                                             global/generic store
12988                                                             atomic/
12989                                                             atomicrmw-no-return-value
12990                                                             with memory
12991                                                             ordering of seq_cst
12992                                                             and with equal or
12993                                                             wider sync scope.
12994                                                             (Note that seq_cst
12995                                                             fences have their
12996                                                             own s_waitcnt
12997                                                             vscnt(0) and so do
12998                                                             not need to be
12999                                                             considered.)
13000                                                           - Ensures any
13001                                                             preceding
13002                                                             sequential
13003                                                             consistent global/local
13004                                                             memory instructions
13005                                                             have completed
13006                                                             before executing
13007                                                             this sequentially
13008                                                             consistent
13009                                                             instruction. This
13010                                                             prevents reordering
13011                                                             a seq_cst store
13012                                                             followed by a
13013                                                             seq_cst load. (Note
13014                                                             that seq_cst is
13015                                                             stronger than
13016                                                             acquire/release as
13017                                                             the reordering of
13018                                                             load acquire
13019                                                             followed by a store
13020                                                             release is
13021                                                             prevented by the
13022                                                             s_waitcnt of
13023                                                             the release, but
13024                                                             there is nothing
13025                                                             preventing a store
13026                                                             release followed by
13027                                                             load acquire from
13028                                                             completing out of
13029                                                             order. The s_waitcnt
13030                                                             could be placed after
13031                                                             seq_store or before
13032                                                             the seq_load. We
13033                                                             choose the load to
13034                                                             make the s_waitcnt be
13035                                                             as late as possible
13036                                                             so that the store
13037                                                             may have already
13038                                                             completed.)
13039
13040                                                         2. *Following
13041                                                            instructions same as
13042                                                            corresponding load
13043                                                            atomic acquire,
13044                                                            except must generate
13045                                                            all instructions even
13046                                                            for OpenCL.*
13047     load atomic  seq_cst      - workgroup    - local
13048
13049                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
13050
13051                                                           - If CU wavefront execution
13052                                                             mode, omit.
13053                                                           - Could be split into
13054                                                             separate s_waitcnt
13055                                                             vmcnt(0) and s_waitcnt
13056                                                             vscnt(0) to allow
13057                                                             them to be
13058                                                             independently moved
13059                                                             according to the
13060                                                             following rules.
13061                                                           - s_waitcnt vmcnt(0)
13062                                                             Must happen after
13063                                                             preceding
13064                                                             global/generic load
13065                                                             atomic/
13066                                                             atomicrmw-with-return-value
13067                                                             with memory
13068                                                             ordering of seq_cst
13069                                                             and with equal or
13070                                                             wider sync scope.
13071                                                             (Note that seq_cst
13072                                                             fences have their
13073                                                             own s_waitcnt
13074                                                             vmcnt(0) and so do
13075                                                             not need to be
13076                                                             considered.)
13077                                                           - s_waitcnt vscnt(0)
13078                                                             Must happen after
13079                                                             preceding
13080                                                             global/generic store
13081                                                             atomic/
13082                                                             atomicrmw-no-return-value
13083                                                             with memory
13084                                                             ordering of seq_cst
13085                                                             and with equal or
13086                                                             wider sync scope.
13087                                                             (Note that seq_cst
13088                                                             fences have their
13089                                                             own s_waitcnt
13090                                                             vscnt(0) and so do
13091                                                             not need to be
13092                                                             considered.)
13093                                                           - Ensures any
13094                                                             preceding
13095                                                             sequential
13096                                                             consistent global
13097                                                             memory instructions
13098                                                             have completed
13099                                                             before executing
13100                                                             this sequentially
13101                                                             consistent
13102                                                             instruction. This
13103                                                             prevents reordering
13104                                                             a seq_cst store
13105                                                             followed by a
13106                                                             seq_cst load. (Note
13107                                                             that seq_cst is
13108                                                             stronger than
13109                                                             acquire/release as
13110                                                             the reordering of
13111                                                             load acquire
13112                                                             followed by a store
13113                                                             release is
13114                                                             prevented by the
13115                                                             s_waitcnt of
13116                                                             the release, but
13117                                                             there is nothing
13118                                                             preventing a store
13119                                                             release followed by
13120                                                             load acquire from
13121                                                             completing out of
13122                                                             order. The s_waitcnt
13123                                                             could be placed after
13124                                                             seq_store or before
13125                                                             the seq_load. We
13126                                                             choose the load to
13127                                                             make the s_waitcnt be
13128                                                             as late as possible
13129                                                             so that the store
13130                                                             may have already
13131                                                             completed.)
13132
13133                                                         2. *Following
13134                                                            instructions same as
13135                                                            corresponding load
13136                                                            atomic acquire,
13137                                                            except must generate
13138                                                            all instructions even
13139                                                            for OpenCL.*
13140
13141     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13142                               - system       - generic     vmcnt(0) & vscnt(0)
13143
13144                                                           - Could be split into
13145                                                             separate s_waitcnt
13146                                                             vmcnt(0), s_waitcnt
13147                                                             vscnt(0) and s_waitcnt
13148                                                             lgkmcnt(0) to allow
13149                                                             them to be
13150                                                             independently moved
13151                                                             according to the
13152                                                             following rules.
13153                                                           - s_waitcnt lgkmcnt(0)
13154                                                             must happen after
13155                                                             preceding
13156                                                             local load
13157                                                             atomic/store
13158                                                             atomic/atomicrmw
13159                                                             with memory
13160                                                             ordering of seq_cst
13161                                                             and with equal or
13162                                                             wider sync scope.
13163                                                             (Note that seq_cst
13164                                                             fences have their
13165                                                             own s_waitcnt
13166                                                             lgkmcnt(0) and so do
13167                                                             not need to be
13168                                                             considered.)
13169                                                           - s_waitcnt vmcnt(0)
13170                                                             must happen after
13171                                                             preceding
13172                                                             global/generic load
13173                                                             atomic/
13174                                                             atomicrmw-with-return-value
13175                                                             with memory
13176                                                             ordering of seq_cst
13177                                                             and with equal or
13178                                                             wider sync scope.
13179                                                             (Note that seq_cst
13180                                                             fences have their
13181                                                             own s_waitcnt
13182                                                             vmcnt(0) and so do
13183                                                             not need to be
13184                                                             considered.)
13185                                                           - s_waitcnt vscnt(0)
13186                                                             Must happen after
13187                                                             preceding
13188                                                             global/generic store
13189                                                             atomic/
13190                                                             atomicrmw-no-return-value
13191                                                             with memory
13192                                                             ordering of seq_cst
13193                                                             and with equal or
13194                                                             wider sync scope.
13195                                                             (Note that seq_cst
13196                                                             fences have their
13197                                                             own s_waitcnt
13198                                                             vscnt(0) and so do
13199                                                             not need to be
13200                                                             considered.)
13201                                                           - Ensures any
13202                                                             preceding
13203                                                             sequential
13204                                                             consistent global
13205                                                             memory instructions
13206                                                             have completed
13207                                                             before executing
13208                                                             this sequentially
13209                                                             consistent
13210                                                             instruction. This
13211                                                             prevents reordering
13212                                                             a seq_cst store
13213                                                             followed by a
13214                                                             seq_cst load. (Note
13215                                                             that seq_cst is
13216                                                             stronger than
13217                                                             acquire/release as
13218                                                             the reordering of
13219                                                             load acquire
13220                                                             followed by a store
13221                                                             release is
13222                                                             prevented by the
13223                                                             s_waitcnt of
13224                                                             the release, but
13225                                                             there is nothing
13226                                                             preventing a store
13227                                                             release followed by
13228                                                             load acquire from
13229                                                             completing out of
13230                                                             order. The s_waitcnt
13231                                                             could be placed after
13232                                                             seq_store or before
13233                                                             the seq_load. We
13234                                                             choose the load to
13235                                                             make the s_waitcnt be
13236                                                             as late as possible
13237                                                             so that the store
13238                                                             may have already
13239                                                             completed.)
13240
13241                                                         2. *Following
13242                                                            instructions same as
13243                                                            corresponding load
13244                                                            atomic acquire,
13245                                                            except must generate
13246                                                            all instructions even
13247                                                            for OpenCL.*
13248     store atomic seq_cst      - singlethread - global   *Same as corresponding
13249                               - wavefront    - local    store atomic release,
13250                               - workgroup    - generic  except must generate
13251                               - agent                   all instructions even
13252                               - system                  for OpenCL.*
13253     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
13254                               - wavefront    - local    atomicrmw acq_rel,
13255                               - workgroup    - generic  except must generate
13256                               - agent                   all instructions even
13257                               - system                  for OpenCL.*
13258     fence        seq_cst      - singlethread *none*     *Same as corresponding
13259                               - wavefront               fence acq_rel,
13260                               - workgroup               except must generate
13261                               - agent                   all instructions even
13262                               - system                  for OpenCL.*
13263     ============ ============ ============== ========== ================================
13264
13265.. _amdgpu-amdhsa-trap-handler-abi:
13266
13267Trap Handler ABI
13268~~~~~~~~~~~~~~~~
13269
13270For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13271runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13272supports the ``s_trap`` instruction. For usage see:
13273
13274- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13275- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13276- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13277
13278  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13279     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13280
13281     =================== =============== =============== =======================================
13282     Usage               Code Sequence   Trap Handler    Description
13283                                         Inputs
13284     =================== =============== =============== =======================================
13285     reserved            ``s_trap 0x00``                 Reserved by hardware.
13286     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
13287                                           ``queue_ptr`` intrinsic (not implemented).
13288                                         ``VGPR0``:
13289                                           ``arg``
13290     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13291                                           ``queue_ptr`` the trap instruction. The associated
13292                                                         queue is signalled to put it into the
13293                                                         error state.  When the queue is put in
13294                                                         the error state, the waves executing
13295                                                         dispatches on the queue will be
13296                                                         terminated.
13297     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13298                                                           as a no-operation. The trap handler
13299                                                           is entered and immediately returns to
13300                                                           continue execution of the wavefront.
13301                                                         - If the debugger is enabled, causes
13302                                                           the debug trap to be reported by the
13303                                                           debugger and the wavefront is put in
13304                                                           the halt state with the PC at the
13305                                                           instruction.  The debugger must
13306                                                           increment the PC and resume the wave.
13307     reserved            ``s_trap 0x04``                 Reserved.
13308     reserved            ``s_trap 0x05``                 Reserved.
13309     reserved            ``s_trap 0x06``                 Reserved.
13310     reserved            ``s_trap 0x07``                 Reserved.
13311     reserved            ``s_trap 0x08``                 Reserved.
13312     reserved            ``s_trap 0xfe``                 Reserved.
13313     reserved            ``s_trap 0xff``                 Reserved.
13314     =================== =============== =============== =======================================
13315
13316..
13317
13318  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13319     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13320
13321     =================== =============== =============== =======================================
13322     Usage               Code Sequence   Trap Handler    Description
13323                                         Inputs
13324     =================== =============== =============== =======================================
13325     reserved            ``s_trap 0x00``                 Reserved by hardware.
13326     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
13327                                                         breakpoints. Causes wave to be halted
13328                                                         with the PC at the trap instruction.
13329                                                         The debugger is responsible to resume
13330                                                         the wave, including the instruction
13331                                                         that the breakpoint overwrote.
13332     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13333                                           ``queue_ptr`` the trap instruction. The associated
13334                                                         queue is signalled to put it into the
13335                                                         error state.  When the queue is put in
13336                                                         the error state, the waves executing
13337                                                         dispatches on the queue will be
13338                                                         terminated.
13339     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13340                                                           as a no-operation. The trap handler
13341                                                           is entered and immediately returns to
13342                                                           continue execution of the wavefront.
13343                                                         - If the debugger is enabled, causes
13344                                                           the debug trap to be reported by the
13345                                                           debugger and the wavefront is put in
13346                                                           the halt state with the PC at the
13347                                                           instruction.  The debugger must
13348                                                           increment the PC and resume the wave.
13349     reserved            ``s_trap 0x04``                 Reserved.
13350     reserved            ``s_trap 0x05``                 Reserved.
13351     reserved            ``s_trap 0x06``                 Reserved.
13352     reserved            ``s_trap 0x07``                 Reserved.
13353     reserved            ``s_trap 0x08``                 Reserved.
13354     reserved            ``s_trap 0xfe``                 Reserved.
13355     reserved            ``s_trap 0xff``                 Reserved.
13356     =================== =============== =============== =======================================
13357
13358..
13359
13360  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13361     :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13362
13363     =================== =============== ================ ================= =======================================
13364     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13365     =================== =============== ================ ================= =======================================
13366     reserved            ``s_trap 0x00``                                    Reserved by hardware.
13367     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
13368                                                                            breakpoints. Causes wave to be halted
13369                                                                            with the PC at the trap instruction.
13370                                                                            The debugger is responsible to resume
13371                                                                            the wave, including the instruction
13372                                                                            that the breakpoint overwrote.
13373     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
13374                                           ``queue_ptr``                    the trap instruction. The associated
13375                                                                            queue is signalled to put it into the
13376                                                                            error state.  When the queue is put in
13377                                                                            the error state, the waves executing
13378                                                                            dispatches on the queue will be
13379                                                                            terminated.
13380     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
13381                                                                              as a no-operation. The trap handler
13382                                                                              is entered and immediately returns to
13383                                                                              continue execution of the wavefront.
13384                                                                            - If the debugger is enabled, causes
13385                                                                              the debug trap to be reported by the
13386                                                                              debugger and the wavefront is put in
13387                                                                              the halt state with the PC at the
13388                                                                              instruction.  The debugger must
13389                                                                              increment the PC and resume the wave.
13390     reserved            ``s_trap 0x04``                                    Reserved.
13391     reserved            ``s_trap 0x05``                                    Reserved.
13392     reserved            ``s_trap 0x06``                                    Reserved.
13393     reserved            ``s_trap 0x07``                                    Reserved.
13394     reserved            ``s_trap 0x08``                                    Reserved.
13395     reserved            ``s_trap 0xfe``                                    Reserved.
13396     reserved            ``s_trap 0xff``                                    Reserved.
13397     =================== =============== ================ ================= =======================================
13398
13399.. _amdgpu-amdhsa-function-call-convention:
13400
13401Call Convention
13402~~~~~~~~~~~~~~~
13403
13404.. note::
13405
13406  This section is currently incomplete and has inaccuracies. It is WIP that will
13407  be updated as information is determined.
13408
13409See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13410addresses. Unswizzled addresses are normal linear addresses.
13411
13412.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13413
13414Kernel Functions
13415++++++++++++++++
13416
13417This section describes the call convention ABI for the outer kernel function.
13418
13419See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13420convention.
13421
13422The following is not part of the AMDGPU kernel calling convention but describes
13423how the AMDGPU implements function calls:
13424
134251.  Clang decides the kernarg layout to match the *HSA Programmer's Language
13426    Reference* [HSA]_.
13427
13428    - All structs are passed directly.
13429    - Lambda values are passed *TBA*.
13430
13431    .. TODO::
13432
13433      - Does this really follow HSA rules? Or are structs >16 bytes passed
13434        by-value struct?
13435      - What is ABI for lambda values?
13436
134374.  The kernel performs certain setup in its prolog, as described in
13438    :ref:`amdgpu-amdhsa-kernel-prolog`.
13439
13440.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13441
13442Non-Kernel Functions
13443++++++++++++++++++++
13444
13445This section describes the call convention ABI for functions other than the
13446outer kernel function.
13447
13448If a kernel has function calls then scratch is always allocated and used for
13449the call stack which grows from low address to high address using the swizzled
13450scratch address space.
13451
13452On entry to a function:
13453
134541.  SGPR0-3 contain a V# with the following properties (see
13455    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13456
13457    * Base address pointing to the beginning of the wavefront scratch backing
13458      memory.
13459    * Swizzled with dword element size and stride of wavefront size elements.
13460
134612.  The FLAT_SCRATCH register pair is setup. See
13462    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
134633.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13464    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
134654.  The EXEC register is set to the lanes active on entry to the function.
134665.  MODE register: *TBD*
134676.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13468    below.
134697.  SGPR30-31 return address (RA). The code address that the function must
13470    return to when it completes. The value is undefined if the function is *no
13471    return*.
134728.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13473    offset relative to the beginning of the wavefront scratch backing memory.
13474
13475    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13476    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13477    manner.
13478
13479    The unswizzled SP value can be converted into the swizzled SP value by:
13480
13481      | swizzled SP = unswizzled SP / wavefront size
13482
13483    This may be used to obtain the private address space address of stack
13484    objects and to convert this address to a flat address by adding the flat
13485    scratch aperture base address.
13486
13487    The swizzled SP value is always 4 bytes aligned for the ``r600``
13488    architecture and 16 byte aligned for the ``amdgcn`` architecture.
13489
13490    .. note::
13491
13492      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13493      OpenCL language which has the largest base type defined as 16 bytes.
13494
13495    On entry, the swizzled SP value is the address of the first function
13496    argument passed on the stack. Other stack passed arguments are positive
13497    offsets from the entry swizzled SP value.
13498
13499    The function may use positive offsets beyond the last stack passed argument
13500    for stack allocated local variables and register spill slots. If necessary,
13501    the function may align these to greater alignment than 16 bytes. After these
13502    the function may dynamically allocate space for such things as runtime sized
13503    ``alloca`` local allocations.
13504
13505    If the function calls another function, it will place any stack allocated
13506    arguments after the last local allocation and adjust SGPR32 to the address
13507    after the last local allocation.
13508
135099.  All other registers are unspecified.
1351010. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13511    to the function.
13512
13513On exit from a function:
13514
135151.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13516    described below. Any registers used are considered clobbered registers.
135172.  The following registers are preserved and have the same value as on entry:
13518
13519    * FLAT_SCRATCH
13520    * EXEC
13521    * GFX6-GFX8: M0
13522    * All SGPR registers except the clobbered registers of SGPR4-31.
13523    * VGPR40-47
13524    * VGPR56-63
13525    * VGPR72-79
13526    * VGPR88-95
13527    * VGPR104-111
13528    * VGPR120-127
13529    * VGPR136-143
13530    * VGPR152-159
13531    * VGPR168-175
13532    * VGPR184-191
13533    * VGPR200-207
13534    * VGPR216-223
13535    * VGPR232-239
13536    * VGPR248-255
13537
13538        .. note::
13539
13540          Except the argument registers, the VGPRs clobbered and the preserved
13541          registers are intermixed at regular intervals in order to keep a
13542          similar ratio independent of the number of allocated VGPRs.
13543
13544    * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13545    * Lanes of all VGPRs that are inactive at the call site.
13546
13547      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13548      optimization may mark some of clobbered SGPR and VGPR registers as
13549      preserved if it can be determined that the called function does not change
13550      their value.
13551
135522.  The PC is set to the RA provided on entry.
135533.  MODE register: *TBD*.
135544.  All other registers are clobbered.
135555.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13556    function is available to the caller.
13557
13558.. TODO::
13559
13560  - How are function results returned? The address of structured types is passed
13561    by reference, but what about other types?
13562
13563The function input arguments are made up of the formal arguments explicitly
13564declared by the source language function plus the implicit input arguments used
13565by the implementation.
13566
13567The source language input arguments are:
13568
135691. Any source language implicit ``this`` or ``self`` argument comes first as a
13570   pointer type.
135712. Followed by the function formal arguments in left to right source order.
13572
13573The source language result arguments are:
13574
135751. The function result argument.
13576
13577The source language input or result struct type arguments that are less than or
13578equal to 16 bytes, are decomposed recursively into their base type fields, and
13579each field is passed as if a separate argument. For input arguments, if the
13580called function requires the struct to be in memory, for example because its
13581address is taken, then the function body is responsible for allocating a stack
13582location and copying the field arguments into it. Clang terms this *direct
13583struct*.
13584
13585The source language input struct type arguments that are greater than 16 bytes,
13586are passed by reference. The caller is responsible for allocating a stack
13587location to make a copy of the struct value and pass the address as the input
13588argument. The called function is responsible to perform the dereference when
13589accessing the input argument. Clang terms this *by-value struct*.
13590
13591A source language result struct type argument that is greater than 16 bytes, is
13592returned by reference. The caller is responsible for allocating a stack location
13593to hold the result value and passes the address as the last input argument
13594(before the implicit input arguments). In this case there are no result
13595arguments. The called function is responsible to perform the dereference when
13596storing the result value. Clang terms this *structured return (sret)*.
13597
13598*TODO: correct the ``sret`` definition.*
13599
13600.. TODO::
13601
13602  Is this definition correct? Or is ``sret`` only used if passing in registers, and
13603  pass as non-decomposed struct as stack argument? Or something else? Is the
13604  memory location in the caller stack frame, or a stack memory argument and so
13605  no address is passed as the caller can directly write to the argument stack
13606  location? But then the stack location is still live after return. If an
13607  argument stack location is it the first stack argument or the last one?
13608
13609Lambda argument types are treated as struct types with an implementation defined
13610set of fields.
13611
13612.. TODO::
13613
13614  Need to specify the ABI for lambda types for AMDGPU.
13615
13616For AMDGPU backend all source language arguments (including the decomposed
13617struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13618they are passed in SGPRs.
13619
13620The AMDGPU backend walks the function call graph from the leaves to determine
13621which implicit input arguments are used, propagating to each caller of the
13622function. The used implicit arguments are appended to the function arguments
13623after the source language arguments in the following order:
13624
13625.. TODO::
13626
13627  Is recursion or external functions supported?
13628
136291.  Work-Item ID (1 VGPR)
13630
13631    The X, Y and Z work-item ID are packed into a single VGRP with the following
13632    layout. Only fields actually used by the function are set. The other bits
13633    are undefined.
13634
13635    The values come from the initial kernel execution state. See
13636    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13637
13638    .. table:: Work-item implicit argument layout
13639      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13640
13641      ======= ======= ==============
13642      Bits    Size    Field Name
13643      ======= ======= ==============
13644      9:0     10 bits X Work-Item ID
13645      19:10   10 bits Y Work-Item ID
13646      29:20   10 bits Z Work-Item ID
13647      31:30   2 bits  Unused
13648      ======= ======= ==============
13649
136502.  Dispatch Ptr (2 SGPRs)
13651
13652    The value comes from the initial kernel execution state. See
13653    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13654
136553.  Queue Ptr (2 SGPRs)
13656
13657    The value comes from the initial kernel execution state. See
13658    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13659
136604.  Kernarg Segment Ptr (2 SGPRs)
13661
13662    The value comes from the initial kernel execution state. See
13663    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13664
136655.  Dispatch id (2 SGPRs)
13666
13667    The value comes from the initial kernel execution state. See
13668    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13669
136706.  Work-Group ID X (1 SGPR)
13671
13672    The value comes from the initial kernel execution state. See
13673    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13674
136757.  Work-Group ID Y (1 SGPR)
13676
13677    The value comes from the initial kernel execution state. See
13678    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13679
136808.  Work-Group ID Z (1 SGPR)
13681
13682    The value comes from the initial kernel execution state. See
13683    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13684
136859.  Implicit Argument Ptr (2 SGPRs)
13686
13687    The value is computed by adding an offset to Kernarg Segment Ptr to get the
13688    global address space pointer to the first kernarg implicit argument.
13689
13690The input and result arguments are assigned in order in the following manner:
13691
13692.. note::
13693
13694  There are likely some errors and omissions in the following description that
13695  need correction.
13696
13697  .. TODO::
13698
13699    Check the Clang source code to decipher how function arguments and return
13700    results are handled. Also see the AMDGPU specific values used.
13701
13702* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13703  VGPR31.
13704
13705  If there are more arguments than will fit in these registers, the remaining
13706  arguments are allocated on the stack in order on naturally aligned
13707  addresses.
13708
13709  .. TODO::
13710
13711    How are overly aligned structures allocated on the stack?
13712
13713* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13714  SGPR29.
13715
13716  If there are more arguments than will fit in these registers, the remaining
13717  arguments are allocated on the stack in order on naturally aligned
13718  addresses.
13719
13720Note that decomposed struct type arguments may have some fields passed in
13721registers and some in memory.
13722
13723.. TODO::
13724
13725  So, a struct which can pass some fields as decomposed register arguments, will
13726  pass the rest as decomposed stack elements? But an argument that will not start
13727  in registers will not be decomposed and will be passed as a non-decomposed
13728  stack value?
13729
13730The following is not part of the AMDGPU function calling convention but
13731describes how the AMDGPU implements function calls:
13732
137331.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13734    unswizzled scratch address. It is only needed if runtime sized ``alloca``
13735    are used, or for the reasons defined in ``SIFrameLowering``.
137362.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13737    to access the incoming stack arguments in the function. The BP is needed
13738    only when the function requires the runtime stack alignment.
13739
137403.  Allocating SGPR arguments on the stack are not supported.
13741
137424.  No CFI is currently generated. See
13743    :ref:`amdgpu-dwarf-call-frame-information`.
13744
13745    .. note::
13746
13747      CFI will be generated that defines the CFA as the unswizzled address
13748      relative to the wave scratch base in the unswizzled private address space
13749      of the lowest address stack allocated local variable.
13750
13751      ``DW_AT_frame_base`` will be defined as the swizzled address in the
13752      swizzled private address space by dividing the CFA by the wavefront size
13753      (since CFA is always at least dword aligned which matches the scratch
13754      swizzle element size).
13755
13756      If no dynamic stack alignment was performed, the stack allocated arguments
13757      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13758      local variables and register spill slots are accessed as positive offsets
13759      relative to ``DW_AT_frame_base``.
13760
137615.  Function argument passing is implemented by copying the input physical
13762    registers to virtual registers on entry. The register allocator can spill if
13763    necessary. These are copied back to physical registers at call sites. The
13764    net effect is that each function call can have these values in entirely
13765    distinct locations. The IPRA can help avoid shuffling argument registers.
137666.  Call sites are implemented by setting up the arguments at positive offsets
13767    from SP. Then SP is incremented to account for the known frame size before
13768    the call and decremented after the call.
13769
13770    .. note::
13771
13772      The CFI will reflect the changed calculation needed to compute the CFA
13773      from SP.
13774
137757.  4 byte spill slots are used in the stack frame. One slot is allocated for an
13776    emergency spill slot. Buffer instructions are used for stack accesses and
13777    not the ``flat_scratch`` instruction.
13778
13779    .. TODO::
13780
13781      Explain when the emergency spill slot is used.
13782
13783.. TODO::
13784
13785  Possible broken issues:
13786
13787  - Stack arguments must be aligned to required alignment.
13788  - Stack is aligned to max(16, max formal argument alignment)
13789  - Direct argument < 64 bits should check register budget.
13790  - Register budget calculation should respect ``inreg`` for SGPR.
13791  - SGPR overflow is not handled.
13792  - struct with 1 member unpeeling is not checking size of member.
13793  - ``sret`` is after ``this`` pointer.
13794  - Caller is not implementing stack realignment: need an extra pointer.
13795  - Should say AMDGPU passes FP rather than SP.
13796  - Should CFI define CFA as address of locals or arguments. Difference is
13797    apparent when have implemented dynamic alignment.
13798  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13799    highest address of stack frame and use negative offset for locals. Would
13800    allow SP to be the same as FP and could support signal-handler-like as now
13801    have a real SP for the top of the stack.
13802  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13803    arguments?
13804
13805AMDPAL
13806------
13807
13808This section provides code conventions used when the target triple OS is
13809``amdpal`` (see :ref:`amdgpu-target-triples`).
13810
13811.. _amdgpu-amdpal-code-object-metadata-section:
13812
13813Code Object Metadata
13814~~~~~~~~~~~~~~~~~~~~
13815
13816.. note::
13817
13818  The metadata is currently in development and is subject to major
13819  changes. Only the current version is supported. *When this document
13820  was generated the version was 2.6.*
13821
13822Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13823record (see :ref:`amdgpu-note-records-v3-onwards`).
13824
13825The metadata is represented as Message Pack formatted binary data (see
13826[MsgPack]_). The top level is a Message Pack map that includes the keys
13827defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13828and referenced tables.
13829
13830Additional information can be added to the maps. To avoid conflicts, any
13831key names should be prefixed by "*vendor-name*." where ``vendor-name``
13832can be the name of the vendor and specific vendor tool that generates the
13833information. The prefix is abbreviated to simply "." when it appears
13834within a map that has been added by the same *vendor-name*.
13835
13836  .. table:: AMDPAL Code Object Metadata Map
13837     :name: amdgpu-amdpal-code-object-metadata-map-table
13838
13839     =================== ============== ========= ======================================================================
13840     String Key          Value Type     Required? Description
13841     =================== ============== ========= ======================================================================
13842     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
13843                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13844     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
13845                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13846                                                  definition of the keys included in that map.
13847     =================== ============== ========= ======================================================================
13848
13849..
13850
13851  .. table:: AMDPAL Code Object Pipeline Metadata Map
13852     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13853
13854     ====================================== ============== ========= ===================================================
13855     String Key                             Value Type     Required? Description
13856     ====================================== ============== ========= ===================================================
13857     ".name"                                string                   Source name of the pipeline.
13858     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
13859
13860                                                                       - "VsPs"
13861                                                                       - "Gs"
13862                                                                       - "Cs"
13863                                                                       - "Ngg"
13864                                                                       - "Tess"
13865                                                                       - "GsTess"
13866                                                                       - "NggTess"
13867
13868     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
13869                                            2 integers               64 bits is the "stable" portion of the hash, used
13870                                                                     for e.g. shader replacement lookup. Upper 64 bits
13871                                                                     is the "unique" portion of the hash, used for
13872                                                                     e.g. pipeline cache lookup. The value is
13873                                                                     implementation defined, and can not be relied on
13874                                                                     between different builds of the compiler.
13875     ".shaders"                             map                      Per-API shader metadata. See
13876                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
13877                                                                     for the definition of the keys included in that
13878                                                                     map.
13879     ".hardware_stages"                     map                      Per-hardware stage metadata. See
13880                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13881                                                                     for the definition of the keys included in that
13882                                                                     map.
13883     ".shader_functions"                    map                      Per-shader function metadata. See
13884                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13885                                                                     for the definition of the keys included in that
13886                                                                     map.
13887     ".registers"                           map            Required  Hardware register configuration. See
13888                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
13889                                                                     for the definition of the keys included in that
13890                                                                     map.
13891     ".user_data_limit"                     integer                  Number of user data entries accessed by this
13892                                                                     pipeline.
13893     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
13894                                                                     NoUserDataSpilling.
13895     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
13896                                                                     viewport array index feature. Pipelines which use
13897                                                                     this feature can render into all 16 viewports,
13898                                                                     whereas pipelines which do not use it are
13899                                                                     restricted to viewport #0.
13900     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
13901                                                                     handling data-passing between the ES and GS
13902                                                                     shader stages. This can be zero if the data is
13903                                                                     passed using off-chip buffers. This value should
13904                                                                     be used to program all user-SGPRs which have been
13905                                                                     marked with "UserDataMapping::EsGsLdsSize"
13906                                                                     (typically only the GS and VS HW stages will ever
13907                                                                     have a user-SGPR so marked).
13908     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
13909                                                                     (maximum number of threads in a subgroup).
13910     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
13911     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
13912     ".api"                                 string                   Name of the client graphics API.
13913     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
13914                                                                     be defined by the driver using the compiler if
13915                                                                     they want to be able to correlate API-specific
13916                                                                     information used during creation at a later time.
13917     ====================================== ============== ========= ===================================================
13918
13919..
13920
13921  .. table:: AMDPAL Code Object Shader Map
13922     :name: amdgpu-amdpal-code-object-shader-map-table
13923
13924
13925     +-------------+--------------+-------------------------------------------------------------------+
13926     |String Key   |Value Type    |Description                                                        |
13927     +=============+==============+===================================================================+
13928     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
13929     |- ".vertex"  |              |for the definition of the keys included in that map.               |
13930     |- ".hull"    |              |                                                                   |
13931     |- ".domain"  |              |                                                                   |
13932     |- ".geometry"|              |                                                                   |
13933     |- ".pixel"   |              |                                                                   |
13934     +-------------+--------------+-------------------------------------------------------------------+
13935
13936..
13937
13938  .. table:: AMDPAL Code Object API Shader Metadata Map
13939     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
13940
13941     ==================== ============== ========= =====================================================================
13942     String Key           Value Type     Required? Description
13943     ==================== ============== ========= =====================================================================
13944     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
13945                          2 integers               is implementation defined, and can not be relied on between
13946                                                   different builds of the compiler.
13947     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
13948                          string                   include:
13949
13950                                                     - ".ls"
13951                                                     - ".hs"
13952                                                     - ".es"
13953                                                     - ".gs"
13954                                                     - ".vs"
13955                                                     - ".ps"
13956                                                     - ".cs"
13957
13958     ==================== ============== ========= =====================================================================
13959
13960..
13961
13962  .. table:: AMDPAL Code Object Hardware Stage Map
13963     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
13964
13965     +-------------+--------------+-----------------------------------------------------------------------+
13966     |String Key   |Value Type    |Description                                                            |
13967     +=============+==============+=======================================================================+
13968     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
13969     |- ".hs"      |              |for the definition of the keys included in that map.                   |
13970     |- ".es"      |              |                                                                       |
13971     |- ".gs"      |              |                                                                       |
13972     |- ".vs"      |              |                                                                       |
13973     |- ".ps"      |              |                                                                       |
13974     |- ".cs"      |              |                                                                       |
13975     +-------------+--------------+-----------------------------------------------------------------------+
13976
13977..
13978
13979  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
13980     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
13981
13982     ========================== ============== ========= ===============================================================
13983     String Key                 Value Type     Required? Description
13984     ========================== ============== ========= ===============================================================
13985     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
13986     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
13987     ".lds_size"                integer                  Local Data Share size in bytes.
13988     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
13989     ".vgpr_count"              integer                  Number of VGPRs used.
13990     ".agpr_count"              integer                  Number of AGPRs used.
13991     ".sgpr_count"              integer                  Number of SGPRs used.
13992     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
13993                                                         directive to instruct the compiler to limit the VGPR usage to
13994                                                         be less than or equal to the specified value (only set if
13995                                                         different from HW default).
13996     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
13997                                                         default).
13998     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
13999                                3 integers
14000     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
14001     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
14002     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
14003     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
14004     ".writes_depth"            boolean                  The shader writes out a depth value.
14005     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
14006                                                         memory or GDS.
14007     ".uses_prim_id"            boolean                  The shader uses PrimID.
14008     ========================== ============== ========= ===============================================================
14009
14010..
14011
14012  .. table:: AMDPAL Code Object Shader Function Map
14013     :name: amdgpu-amdpal-code-object-shader-function-map-table
14014
14015     =============== ============== ====================================================================
14016     String Key      Value Type     Description
14017     =============== ============== ====================================================================
14018     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
14019                                    entry address. The value is the function's metadata. See
14020                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14021     =============== ============== ====================================================================
14022
14023..
14024
14025  .. table:: AMDPAL Code Object Shader Function Metadata Map
14026     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14027
14028     ============================= ============== =================================================================
14029     String Key                    Value Type     Description
14030     ============================= ============== =================================================================
14031     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
14032                                   2 integers     is implementation defined, and can not be relied on between
14033                                                  different builds of the compiler.
14034     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
14035     ".lds_size"                   integer        Size in bytes of LDS memory.
14036     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
14037     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
14038     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
14039     ".shader_subtype"             string         Shader subtype/kind. Values include:
14040
14041                                                    - "Unknown"
14042
14043     ============================= ============== =================================================================
14044
14045..
14046
14047  .. table:: AMDPAL Code Object Register Map
14048     :name: amdgpu-amdpal-code-object-register-map-table
14049
14050     ========================== ============== ====================================================================
14051     32-bit Integer Key         Value Type     Description
14052     ========================== ============== ====================================================================
14053     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14054                                               a GRBM register (i.e., driver accessible GPU register number, not
14055                                               shader GPR register number). The driver is required to program each
14056                                               specified register to the corresponding specified value when
14057                                               executing this pipeline. Typically, the ``reg offsets`` are the
14058                                               ``uint16_t`` offsets to each register as defined by the hardware
14059                                               chip headers. The register is set to the provided value. However, a
14060                                               ``reg offset`` that specifies a user data register (e.g.,
14061                                               COMPUTE_USER_DATA_0) needs special treatment. See
14062                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14063                                               information.
14064     ========================== ============== ====================================================================
14065
14066.. _amdgpu-amdpal-code-object-user-data-section:
14067
14068User Data
14069+++++++++
14070
14071Each hardware stage has a set of 32-bit physical SPI *user data registers*
14072(either 16 or 32 based on graphics IP and the stage) which can be
14073written from a command buffer and then loaded into SGPRs when waves are
14074launched via a subsequent dispatch or draw operation. This is the way
14075most arguments are passed from the application/runtime to a hardware
14076shader.
14077
14078PAL abstracts this functionality by exposing a set of 128 *user data
14079entries* per pipeline a client can use to pass arguments from a command
14080buffer to one or more shaders in that pipeline. The ELF code object must
14081specify a mapping from virtualized *user data entries* to physical *user
14082data registers*, and PAL is responsible for implementing that mapping,
14083including spilling overflow *user data entries* to memory if needed.
14084
14085Since the *user data registers* are GRBM-accessible SPI registers, this
14086mapping is actually embedded in the ``.registers`` metadata entry. For
14087most registers, the value in that map is a literal 32-bit value that
14088should be written to the register by the driver. However, when the
14089register is a *user data register* (any USER_DATA register e.g.,
14090SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14091the driver to write either a *user data entry* value or one of several
14092driver-internal values to the register. This encoding is described in
14093the following table:
14094
14095.. note::
14096
14097  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14098  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14099  always be programmed to the address of the GlobalTable, and *user data
14100  register* 1 must always be programmed to the address of the PerShaderTable.
14101
14102..
14103
14104  .. table:: AMDPAL User Data Mapping
14105     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14106
14107     ==========  =================  ===============================================================================
14108     Value       Name               Description
14109     ==========  =================  ===============================================================================
14110     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14111     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
14112                                    always point to *user data register* 0).
14113     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
14114                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14115                                    for more detail (should always point to *user data register* 1).
14116     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
14117                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14118                                    more detail.
14119     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14120                                    reference the draw index in the vertex shader. Only supported by the first
14121                                    stage in a graphics pipeline.
14122     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
14123                                    a graphics pipeline.
14124     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
14125                                    graphics pipeline.
14126     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14127                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
14128                                    high half of the address is stored in the next sequential user-SGPR. Only
14129                                    supported by compute pipelines.
14130     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
14131                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
14132                                    stages.
14133     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
14134                                    pipeline instancing.
14135     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
14136                                    can only appear for one shader stage per pipeline.
14137     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
14138     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
14139                                    only appear for one shader stage per pipeline.
14140     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
14141                                    only appear for one shader stage per pipeline (PS). These replace color targets
14142                                    and are completely separate from any UAVs used by the shader. This is optional,
14143                                    and only used by the PS when UAV exports are used to replace color-target
14144                                    exports to optimize specific shaders.
14145     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
14146                                    some NGG pipelines to perform culling.  This value contains the address of the
14147                                    first of two consecutive registers which provide the full GPU address.
14148     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
14149     ==========  =================  ===============================================================================
14150
14151.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14152
14153Per-Shader Table
14154################
14155
14156Low 32 bits of the GPU address for an optional buffer in the ``.data``
14157section of the ELF. The high 32 bits of the address match the high 32 bits
14158of the shader's program counter.
14159
14160The buffer can be anything the shader compiler needs it for, and
14161allows each shader to have its own region of the ``.data`` section.
14162Typically, this could be a table of buffer SRD's and the data pointed to
14163by the buffer SRD's, but it could be a flat-address region of memory as
14164well. Its layout and usage are defined by the shader compiler.
14165
14166Each shader's table in the ``.data`` section is referenced by the symbol
14167``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
14168hardware shader stage the data is for. E.g.,
14169``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14170
14171.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14172
14173Spill Table
14174###########
14175
14176It is possible for a hardware shader to need access to more *user data
14177entries* than there are slots available in user data registers for one
14178or more hardware shader stages. In that case, the PAL runtime expects
14179the necessary *user data entries* to be spilled to GPU memory and use
14180one user data register to point to the spilled user data memory. The
14181value of the *user data entry* must then represent the location where
14182a shader expects to read the low 32-bits of the table's GPU virtual
14183address. The *spill table* itself represents a set of 32-bit values
14184managed by the PAL runtime in GPU-accessible memory that can be made
14185indirectly accessible to a hardware shader.
14186
14187Unspecified OS
14188--------------
14189
14190This section provides code conventions used when the target triple OS is
14191empty (see :ref:`amdgpu-target-triples`).
14192
14193Trap Handler ABI
14194~~~~~~~~~~~~~~~~
14195
14196For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14197not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14198instructions are handled as follows:
14199
14200  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14201     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14202
14203     =============== =============== ===========================================
14204     Usage           Code Sequence   Description
14205     =============== =============== ===========================================
14206     llvm.trap       s_endpgm        Causes wavefront to be terminated.
14207     llvm.debugtrap  *none*          Compiler warning given that there is no
14208                                     trap handler installed.
14209     =============== =============== ===========================================
14210
14211Source Languages
14212================
14213
14214.. _amdgpu-opencl:
14215
14216OpenCL
14217------
14218
14219When the language is OpenCL the following differences occur:
14220
142211. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
142222. The AMDGPU backend appends additional arguments to the kernel's explicit
14223   arguments for the AMDHSA OS (see
14224   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
142253. Additional metadata is generated
14226   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14227
14228  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14229     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14230
14231     ======== ==== ========= ===========================================
14232     Position Byte Byte      Description
14233              Size Alignment
14234     ======== ==== ========= ===========================================
14235     1        8    8         OpenCL Global Offset X
14236     2        8    8         OpenCL Global Offset Y
14237     3        8    8         OpenCL Global Offset Z
14238     4        8    8         OpenCL address of printf buffer
14239     5        8    8         OpenCL address of virtual queue used by
14240                             enqueue_kernel.
14241     6        8    8         OpenCL address of AqlWrap struct used by
14242                             enqueue_kernel.
14243     7        8    8         Pointer argument used for Multi-gird
14244                             synchronization.
14245     ======== ==== ========= ===========================================
14246
14247.. _amdgpu-hcc:
14248
14249HCC
14250---
14251
14252When the language is HCC the following differences occur:
14253
142541. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14255
14256.. _amdgpu-assembler:
14257
14258Assembler
14259---------
14260
14261AMDGPU backend has LLVM-MC based assembler which is currently in development.
14262It supports AMDGCN GFX6-GFX11.
14263
14264This section describes general syntax for instructions and operands.
14265
14266Instructions
14267~~~~~~~~~~~~
14268
14269An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14270
14271  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14272    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14273
14274:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14275:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14276
14277The order of operands and modifiers is fixed.
14278Most modifiers are optional and may be omitted.
14279
14280Links to detailed instruction syntax description may be found in the following
14281table. Note that features under development are not included
14282in this description.
14283
14284    ============= ============================================= =======================================
14285    Architecture  Core ISA                                      ISA Variants and Extensions
14286    ============= ============================================= =======================================
14287    GCN 2         :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`             \-
14288    GCN 3, GCN 4  :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`             \-
14289    GCN 5         :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14290
14291                                                                :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14292
14293                                                                :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14294
14295                                                                :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14296
14297                                                                :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14298
14299                                                                :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14300
14301    CDNA 1        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14302
14303    CDNA 2        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14304
14305    CDNA 3        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14306
14307    RDNA 1        :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>`     :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14308
14309                                                                :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14310
14311                                                                :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14312
14313                                                                :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14314
14315    RDNA 2        :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>`   :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14316
14317                                                                :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14318
14319                                                                :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14320
14321                                                                :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14322
14323                                                                :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14324
14325                                                                :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14326
14327                                                                :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14328    ============= ============================================= =======================================
14329
14330For more information about instructions, their semantics and supported
14331combinations of operands, refer to one of instruction set architecture manuals
14332[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14333[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14334[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_ and
14335[AMD-GCN-GFX10-RDNA2]_.
14336
14337Operands
14338~~~~~~~~
14339
14340Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14341
14342Modifiers
14343~~~~~~~~~
14344
14345Detailed description of modifiers may be found
14346:doc:`here<AMDGPUModifierSyntax>`.
14347
14348Instruction Examples
14349~~~~~~~~~~~~~~~~~~~~
14350
14351DS
14352++
14353
14354.. code-block:: nasm
14355
14356  ds_add_u32 v2, v4 offset:16
14357  ds_write_src2_b64 v2 offset0:4 offset1:8
14358  ds_cmpst_f32 v2, v4, v6
14359  ds_min_rtn_f64 v[8:9], v2, v[4:5]
14360
14361For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14362Manual.
14363
14364FLAT
14365++++
14366
14367.. code-block:: nasm
14368
14369  flat_load_dword v1, v[3:4]
14370  flat_store_dwordx3 v[3:4], v[5:7]
14371  flat_atomic_swap v1, v[3:4], v5 glc
14372  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14373  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14374
14375For full list of supported instructions, refer to "FLAT instructions" in ISA
14376Manual.
14377
14378MUBUF
14379+++++
14380
14381.. code-block:: nasm
14382
14383  buffer_load_dword v1, off, s[4:7], s1
14384  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14385  buffer_store_format_xy v[1:2], off, s[4:7], s1
14386  buffer_wbinvl1
14387  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14388
14389For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14390Manual.
14391
14392SMRD/SMEM
14393+++++++++
14394
14395.. code-block:: nasm
14396
14397  s_load_dword s1, s[2:3], 0xfc
14398  s_load_dwordx8 s[8:15], s[2:3], s4
14399  s_load_dwordx16 s[88:103], s[2:3], s4
14400  s_dcache_inv_vol
14401  s_memtime s[4:5]
14402
14403For full list of supported instructions, refer to "Scalar Memory Operations" in
14404ISA Manual.
14405
14406SOP1
14407++++
14408
14409.. code-block:: nasm
14410
14411  s_mov_b32 s1, s2
14412  s_mov_b64 s[0:1], 0x80000000
14413  s_cmov_b32 s1, 200
14414  s_wqm_b64 s[2:3], s[4:5]
14415  s_bcnt0_i32_b64 s1, s[2:3]
14416  s_swappc_b64 s[2:3], s[4:5]
14417  s_cbranch_join s[4:5]
14418
14419For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14420Manual.
14421
14422SOP2
14423++++
14424
14425.. code-block:: nasm
14426
14427  s_add_u32 s1, s2, s3
14428  s_and_b64 s[2:3], s[4:5], s[6:7]
14429  s_cselect_b32 s1, s2, s3
14430  s_andn2_b32 s2, s4, s6
14431  s_lshr_b64 s[2:3], s[4:5], s6
14432  s_ashr_i32 s2, s4, s6
14433  s_bfm_b64 s[2:3], s4, s6
14434  s_bfe_i64 s[2:3], s[4:5], s6
14435  s_cbranch_g_fork s[4:5], s[6:7]
14436
14437For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14438Manual.
14439
14440SOPC
14441++++
14442
14443.. code-block:: nasm
14444
14445  s_cmp_eq_i32 s1, s2
14446  s_bitcmp1_b32 s1, s2
14447  s_bitcmp0_b64 s[2:3], s4
14448  s_setvskip s3, s5
14449
14450For full list of supported instructions, refer to "SOPC Instructions" in ISA
14451Manual.
14452
14453SOPP
14454++++
14455
14456.. code-block:: nasm
14457
14458  s_barrier
14459  s_nop 2
14460  s_endpgm
14461  s_waitcnt 0 ; Wait for all counters to be 0
14462  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14463  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14464  s_sethalt 9
14465  s_sleep 10
14466  s_sendmsg 0x1
14467  s_sendmsg sendmsg(MSG_INTERRUPT)
14468  s_trap 1
14469
14470For full list of supported instructions, refer to "SOPP Instructions" in ISA
14471Manual.
14472
14473Unless otherwise mentioned, little verification is performed on the operands
14474of SOPP Instructions, so it is up to the programmer to be familiar with the
14475range or acceptable values.
14476
14477VALU
14478++++
14479
14480For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14481the assembler will automatically use optimal encoding based on its operands. To
14482force specific encoding, one can add a suffix to the opcode of the instruction:
14483
14484* _e32 for 32-bit VOP1/VOP2/VOPC
14485* _e64 for 64-bit VOP3
14486* _dpp for VOP_DPP
14487* _sdwa for VOP_SDWA
14488
14489VOP1/VOP2/VOP3/VOPC examples:
14490
14491.. code-block:: nasm
14492
14493  v_mov_b32 v1, v2
14494  v_mov_b32_e32 v1, v2
14495  v_nop
14496  v_cvt_f64_i32_e32 v[1:2], v2
14497  v_floor_f32_e32 v1, v2
14498  v_bfrev_b32_e32 v1, v2
14499  v_add_f32_e32 v1, v2, v3
14500  v_mul_i32_i24_e64 v1, v2, 3
14501  v_mul_i32_i24_e32 v1, -3, v3
14502  v_mul_i32_i24_e32 v1, -100, v3
14503  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14504  v_max_f16_e32 v1, v2, v3
14505
14506VOP_DPP examples:
14507
14508.. code-block:: nasm
14509
14510  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14511  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14512  v_mov_b32 v0, v0 wave_shl:1
14513  v_mov_b32 v0, v0 row_mirror
14514  v_mov_b32 v0, v0 row_bcast:31
14515  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14516  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14517  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14518
14519VOP_SDWA examples:
14520
14521.. code-block:: nasm
14522
14523  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14524  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14525  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14526  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14527  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14528
14529For full list of supported instructions, refer to "Vector ALU instructions".
14530
14531.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14532
14533Code Object V2 Predefined Symbols
14534~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14535
14536.. warning::
14537  Code object V2 is not the default code object version emitted by
14538  this version of LLVM.
14539
14540The AMDGPU assembler defines and updates some symbols automatically. These
14541symbols do not affect code generation.
14542
14543.option.machine_version_major
14544+++++++++++++++++++++++++++++
14545
14546Set to the GFX major generation number of the target being assembled for. For
14547example, when assembling for a "GFX9" target this will be set to the integer
14548value "9". The possible GFX major generation numbers are presented in
14549:ref:`amdgpu-processors`.
14550
14551.option.machine_version_minor
14552+++++++++++++++++++++++++++++
14553
14554Set to the GFX minor generation number of the target being assembled for. For
14555example, when assembling for a "GFX810" target this will be set to the integer
14556value "1". The possible GFX minor generation numbers are presented in
14557:ref:`amdgpu-processors`.
14558
14559.option.machine_version_stepping
14560++++++++++++++++++++++++++++++++
14561
14562Set to the GFX stepping generation number of the target being assembled for.
14563For example, when assembling for a "GFX704" target this will be set to the
14564integer value "4". The possible GFX stepping generation numbers are presented
14565in :ref:`amdgpu-processors`.
14566
14567.kernel.vgpr_count
14568++++++++++++++++++
14569
14570Set to zero each time a
14571:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14572encountered. At each instruction, if the current value of this symbol is less
14573than or equal to the maximum VGPR number explicitly referenced within that
14574instruction then the symbol value is updated to equal that VGPR number plus
14575one.
14576
14577.kernel.sgpr_count
14578++++++++++++++++++
14579
14580Set to zero each time a
14581:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14582encountered. At each instruction, if the current value of this symbol is less
14583than or equal to the maximum VGPR number explicitly referenced within that
14584instruction then the symbol value is updated to equal that SGPR number plus
14585one.
14586
14587.. _amdgpu-amdhsa-assembler-directives-v2:
14588
14589Code Object V2 Directives
14590~~~~~~~~~~~~~~~~~~~~~~~~~
14591
14592.. warning::
14593  Code object V2 is not the default code object version emitted by
14594  this version of LLVM.
14595
14596AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14597one can specify them with assembler directives.
14598
14599.hsa_code_object_version major, minor
14600+++++++++++++++++++++++++++++++++++++
14601
14602*major* and *minor* are integers that specify the version of the HSA code
14603object that will be generated by the assembler.
14604
14605.hsa_code_object_isa [major, minor, stepping, vendor, arch]
14606+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14607
14608
14609*major*, *minor*, and *stepping* are all integers that describe the instruction
14610set architecture (ISA) version of the assembly program.
14611
14612*vendor* and *arch* are quoted strings. *vendor* should always be equal to
14613"AMD" and *arch* should always be equal to "AMDGPU".
14614
14615By default, the assembler will derive the ISA version, *vendor*, and *arch*
14616from the value of the -mcpu option that is passed to the assembler.
14617
14618.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14619
14620.amdgpu_hsa_kernel (name)
14621+++++++++++++++++++++++++
14622
14623This directives specifies that the symbol with given name is a kernel entry
14624point (label) and the object should contain corresponding symbol of type
14625STT_AMDGPU_HSA_KERNEL.
14626
14627.amd_kernel_code_t
14628++++++++++++++++++
14629
14630This directive marks the beginning of a list of key / value pairs that are used
14631to specify the amd_kernel_code_t object that will be emitted by the assembler.
14632The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14633amd_kernel_code_t values that are unspecified a default value will be used. The
14634default value for all keys is 0, with the following exceptions:
14635
14636- *amd_code_version_major* defaults to 1.
14637- *amd_kernel_code_version_minor* defaults to 2.
14638- *amd_machine_kind* defaults to 1.
14639- *amd_machine_version_major*, *machine_version_minor*, and
14640  *amd_machine_version_stepping* are derived from the value of the -mcpu option
14641  that is passed to the assembler.
14642- *kernel_code_entry_byte_offset* defaults to 256.
14643- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14644  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14645  Note that wavefront size is specified as a power of two, so a value of **n**
14646  means a size of 2^ **n**.
14647- *call_convention* defaults to -1.
14648- *kernarg_segment_alignment*, *group_segment_alignment*, and
14649  *private_segment_alignment* default to 4. Note that alignments are specified
14650  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14651- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14652  GFX90A onwards.
14653- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14654  GFX10 onwards.
14655- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14656
14657The *.amd_kernel_code_t* directive must be placed immediately after the
14658function label and before any instructions.
14659
14660For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14661comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14662
14663.. _amdgpu-amdhsa-assembler-example-v2:
14664
14665Code Object V2 Example Source Code
14666~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14667
14668.. warning::
14669  Code Object V2 is not the default code object version emitted by
14670  this version of LLVM.
14671
14672Here is an example of a minimal assembly source file, defining one HSA kernel:
14673
14674.. code::
14675   :number-lines:
14676
14677   .hsa_code_object_version 1,0
14678   .hsa_code_object_isa
14679
14680   .hsatext
14681   .globl  hello_world
14682   .p2align 8
14683   .amdgpu_hsa_kernel hello_world
14684
14685   hello_world:
14686
14687      .amd_kernel_code_t
14688         enable_sgpr_kernarg_segment_ptr = 1
14689         is_ptr64 = 1
14690         compute_pgm_rsrc1_vgprs = 0
14691         compute_pgm_rsrc1_sgprs = 0
14692         compute_pgm_rsrc2_user_sgpr = 2
14693         compute_pgm_rsrc1_wgp_mode = 0
14694         compute_pgm_rsrc1_mem_ordered = 0
14695         compute_pgm_rsrc1_fwd_progress = 1
14696     .end_amd_kernel_code_t
14697
14698     s_load_dwordx2 s[0:1], s[0:1] 0x0
14699     v_mov_b32 v0, 3.14159
14700     s_waitcnt lgkmcnt(0)
14701     v_mov_b32 v1, s0
14702     v_mov_b32 v2, s1
14703     flat_store_dword v[1:2], v0
14704     s_endpgm
14705   .Lfunc_end0:
14706        .size   hello_world, .Lfunc_end0-hello_world
14707
14708.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14709
14710Code Object V3 and Above Predefined Symbols
14711~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14712
14713The AMDGPU assembler defines and updates some symbols automatically. These
14714symbols do not affect code generation.
14715
14716.amdgcn.gfx_generation_number
14717+++++++++++++++++++++++++++++
14718
14719Set to the GFX major generation number of the target being assembled for. For
14720example, when assembling for a "GFX9" target this will be set to the integer
14721value "9". The possible GFX major generation numbers are presented in
14722:ref:`amdgpu-processors`.
14723
14724.amdgcn.gfx_generation_minor
14725++++++++++++++++++++++++++++
14726
14727Set to the GFX minor generation number of the target being assembled for. For
14728example, when assembling for a "GFX810" target this will be set to the integer
14729value "1". The possible GFX minor generation numbers are presented in
14730:ref:`amdgpu-processors`.
14731
14732.amdgcn.gfx_generation_stepping
14733+++++++++++++++++++++++++++++++
14734
14735Set to the GFX stepping generation number of the target being assembled for.
14736For example, when assembling for a "GFX704" target this will be set to the
14737integer value "4". The possible GFX stepping generation numbers are presented
14738in :ref:`amdgpu-processors`.
14739
14740.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14741
14742.amdgcn.next_free_vgpr
14743++++++++++++++++++++++
14744
14745Set to zero before assembly begins. At each instruction, if the current value
14746of this symbol is less than or equal to the maximum VGPR number explicitly
14747referenced within that instruction then the symbol value is updated to equal
14748that VGPR number plus one.
14749
14750May be used to set the `.amdhsa_next_free_vgpr` directive in
14751:ref:`amdhsa-kernel-directives-table`.
14752
14753May be set at any time, e.g. manually set to zero at the start of each kernel.
14754
14755.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14756
14757.amdgcn.next_free_sgpr
14758++++++++++++++++++++++
14759
14760Set to zero before assembly begins. At each instruction, if the current value
14761of this symbol is less than or equal the maximum SGPR number explicitly
14762referenced within that instruction then the symbol value is updated to equal
14763that SGPR number plus one.
14764
14765May be used to set the `.amdhsa_next_free_spgr` directive in
14766:ref:`amdhsa-kernel-directives-table`.
14767
14768May be set at any time, e.g. manually set to zero at the start of each kernel.
14769
14770.. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14771
14772Code Object V3 and Above Directives
14773~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14774
14775Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14776architecture processors, and are not OS-specific. Directives which begin with
14777``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14778``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14779:ref:`amdgpu-processors`.
14780
14781.. _amdgpu-assembler-directive-amdgcn-target:
14782
14783.amdgcn_target <target-triple> "-" <target-id>
14784++++++++++++++++++++++++++++++++++++++++++++++
14785
14786Optional directive which declares the ``<target-triple>-<target-id>`` supported
14787by the containing assembler source file. Used by the assembler to validate
14788command-line options such as ``-triple``, ``-mcpu``, and
14789``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14790:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14791
14792.. note::
14793
14794  The target ID syntax used for code object V2 to V3 for this directive differs
14795  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14796
14797.amdhsa_kernel <name>
14798+++++++++++++++++++++
14799
14800Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14801``<name>.kd``, in the current location of the current section. Only valid when
14802the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14803instruction to execute, and does not need to be previously defined.
14804
14805Marks the beginning of a list of directives used to generate the bytes of a
14806kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14807Directives which may appear in this list are described in
14808:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14809be valid for the target being assembled for, and cannot be repeated. Directives
14810support the range of values specified by the field they reference in
14811:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14812assumed to have its default value, unless it is marked as "Required", in which
14813case it is an error to omit the directive. This list of directives is
14814terminated by an ``.end_amdhsa_kernel`` directive.
14815
14816  .. table:: AMDHSA Kernel Assembler Directives
14817     :name: amdhsa-kernel-directives-table
14818
14819     ======================================================== =================== ============ ===================
14820     Directive                                                Default             Supported On Description
14821     ======================================================== =================== ============ ===================
14822     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX11   Controls GROUP_SEGMENT_FIXED_SIZE in
14823                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14824     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX11   Controls PRIVATE_SEGMENT_FIXED_SIZE in
14825                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14826     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX11   Controls KERNARG_SIZE in
14827                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14828     ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX11   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14829                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14830     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14831                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14832                                                                                  GFX940)
14833     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_PTR in
14834                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14835     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX11   Controls ENABLE_SGPR_QUEUE_PTR in
14836                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14837     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX11   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14838                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14839     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_ID in
14840                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14841     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14842                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14843                                                                                  GFX940)
14844     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX11   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14845                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14846     ``.amdhsa_wavefront_size32``                             Target              GFX10-GFX11  Controls ENABLE_WAVEFRONT_SIZE32 in
14847                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14848                                                              Specific
14849                                                              (wavefrontsize64)
14850     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
14851                                                                                  (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14852                                                                                  GFX940)
14853     ``.amdhsa_enable_private_segment``                       0                   GFX940,      Controls ENABLE_PRIVATE_SEGMENT in
14854                                                                                  GFX11        :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14855     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_X in
14856                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14857     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14858                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14859     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14860                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14861     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_INFO in
14862                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14863     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX11   Controls ENABLE_VGPR_WORKITEM_ID in
14864                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14865                                                                                               Possible values are defined in
14866                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14867     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX11   Maximum VGPR number explicitly referenced, plus one.
14868                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14869                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14870     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX11   Maximum SGPR number explicitly referenced, plus one.
14871                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14872                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14873     ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
14874                                                                                  GFX940       Used to calculate ACCUM_OFFSET in
14875                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14876     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX11   Whether the kernel may use the special VCC SGPR.
14877                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14878                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14879     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
14880                                                                                  (except      scratch memory. Used to calculate
14881                                                                                  GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
14882                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14883     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
14884                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14885                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14886                                                              (xnack)
14887     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_32 in
14888                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14889                                                                                               Possible values are defined in
14890                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14891     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_16_64 in
14892                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14893                                                                                               Possible values are defined in
14894                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14895     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_32 in
14896                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14897                                                                                               Possible values are defined in
14898                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14899     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_16_64 in
14900                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14901                                                                                               Possible values are defined in
14902                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14903     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX11   Controls ENABLE_DX10_CLAMP in
14904                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14905     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX11   Controls ENABLE_IEEE_MODE in
14906                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14907     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX11   Controls FP16_OVFL in
14908                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14909     ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
14910                                                              Feature             GFX940,      :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14911                                                              Specific            GFX11
14912                                                              (tgsplit)
14913     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10-GFX11  Controls ENABLE_WGP_MODE in
14914                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14915                                                              Specific
14916                                                              (cumode)
14917     ``.amdhsa_memory_ordered``                               1                   GFX10-GFX11  Controls MEM_ORDERED in
14918                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14919     ``.amdhsa_forward_progress``                             0                   GFX10-GFX11  Controls FWD_PROGRESS in
14920                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14921     ``.amdhsa_shared_vgpr_count``                            0                   GFX10-GFX11  Controls SHARED_VGPR_COUNT in
14922                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
14923     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
14924                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14925     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
14926                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14927     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
14928                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14929     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
14930                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14931     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
14932                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14933     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
14934                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14935     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
14936                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14937     ======================================================== =================== ============ ===================
14938
14939.amdgpu_metadata
14940++++++++++++++++
14941
14942Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
14943note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
14944
14945The contents must be in the [YAML]_ markup format, with the same structure and
14946semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
14947:ref:`amdgpu-amdhsa-code-object-metadata-v4` or
14948:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
14949
14950This directive is terminated by an ``.end_amdgpu_metadata`` directive.
14951
14952.. _amdgpu-amdhsa-assembler-example-v3-onwards:
14953
14954Code Object V3 and Above Example Source Code
14955~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14956
14957Here is an example of a minimal assembly source file, defining one HSA kernel:
14958
14959.. code::
14960   :number-lines:
14961
14962   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
14963
14964   .text
14965   .globl hello_world
14966   .p2align 8
14967   .type hello_world,@function
14968   hello_world:
14969     s_load_dwordx2 s[0:1], s[0:1] 0x0
14970     v_mov_b32 v0, 3.14159
14971     s_waitcnt lgkmcnt(0)
14972     v_mov_b32 v1, s0
14973     v_mov_b32 v2, s1
14974     flat_store_dword v[1:2], v0
14975     s_endpgm
14976   .Lfunc_end0:
14977     .size   hello_world, .Lfunc_end0-hello_world
14978
14979   .rodata
14980   .p2align 6
14981   .amdhsa_kernel hello_world
14982     .amdhsa_user_sgpr_kernarg_segment_ptr 1
14983     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
14984     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
14985   .end_amdhsa_kernel
14986
14987   .amdgpu_metadata
14988   ---
14989   amdhsa.version:
14990     - 1
14991     - 0
14992   amdhsa.kernels:
14993     - .name: hello_world
14994       .symbol: hello_world.kd
14995       .kernarg_segment_size: 48
14996       .group_segment_fixed_size: 0
14997       .private_segment_fixed_size: 0
14998       .kernarg_segment_align: 4
14999       .wavefront_size: 64
15000       .sgpr_count: 2
15001       .vgpr_count: 3
15002       .max_flat_workgroup_size: 256
15003       .args:
15004         - .size: 8
15005           .offset: 0
15006           .value_kind: global_buffer
15007           .address_space: global
15008           .actual_access: write_only
15009   //...
15010   .end_amdgpu_metadata
15011
15012This kernel is equivalent to the following HIP program:
15013
15014.. code::
15015   :number-lines:
15016
15017   __global__ void hello_world(float *p) {
15018       *p = 3.14159f;
15019   }
15020
15021If an assembly source file contains multiple kernels and/or functions, the
15022:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15023:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15024the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15025kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15026to group the function with the kernel that calls it and reset the symbols
15027between the two connected components:
15028
15029.. code::
15030   :number-lines:
15031
15032   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15033
15034   // gpr tracking symbols are implicitly set to zero
15035
15036   .text
15037   .globl kern0
15038   .p2align 8
15039   .type kern0,@function
15040   kern0:
15041     // ...
15042     s_endpgm
15043   .Lkern0_end:
15044     .size   kern0, .Lkern0_end-kern0
15045
15046   .rodata
15047   .p2align 6
15048   .amdhsa_kernel kern0
15049     // ...
15050     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15051     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15052   .end_amdhsa_kernel
15053
15054   // reset symbols to begin tracking usage in func1 and kern1
15055   .set .amdgcn.next_free_vgpr, 0
15056   .set .amdgcn.next_free_sgpr, 0
15057
15058   .text
15059   .hidden func1
15060   .global func1
15061   .p2align 2
15062   .type func1,@function
15063   func1:
15064     // ...
15065     s_setpc_b64 s[30:31]
15066   .Lfunc1_end:
15067   .size func1, .Lfunc1_end-func1
15068
15069   .globl kern1
15070   .p2align 8
15071   .type kern1,@function
15072   kern1:
15073     // ...
15074     s_getpc_b64 s[4:5]
15075     s_add_u32 s4, s4, func1@rel32@lo+4
15076     s_addc_u32 s5, s5, func1@rel32@lo+4
15077     s_swappc_b64 s[30:31], s[4:5]
15078     // ...
15079     s_endpgm
15080   .Lkern1_end:
15081     .size   kern1, .Lkern1_end-kern1
15082
15083   .rodata
15084   .p2align 6
15085   .amdhsa_kernel kern1
15086     // ...
15087     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15088     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15089   .end_amdhsa_kernel
15090
15091These symbols cannot identify connected components in order to automatically
15092track the usage for each kernel. However, in some cases careful organization of
15093the kernels and functions in the source file means there is minimal additional
15094effort required to accurately calculate GPR usage.
15095
15096Additional Documentation
15097========================
15098
15099.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15100.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15101.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15102.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15103.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15104.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15105.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15106.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15107.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15108.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15109.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15110.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15111.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15112.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15113.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15114.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15115.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15116.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15117.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15118.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15119.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15120.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15121.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15122.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15123.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
15124