1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX940
20   AMDGPU/AMDGPUAsmGFX10
21   AMDGPU/AMDGPUAsmGFX1011
22   AMDGPU/AMDGPUAsmGFX1013
23   AMDGPU/AMDGPUAsmGFX1030
24   AMDGPUModifierSyntax
25   AMDGPUOperandSyntax
26   AMDGPUInstructionSyntax
27   AMDGPUInstructionNotation
28   AMDGPUDwarfExtensionsForHeterogeneousDebugging
29   AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
30
31Introduction
32============
33
34The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
35R600 family up until the current GCN families. It lives in the
36``llvm/lib/Target/AMDGPU`` directory.
37
38LLVM
39====
40
41.. _amdgpu-target-triples:
42
43Target Triples
44--------------
45
46Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
47to specify the target triple:
48
49  .. table:: AMDGPU Architectures
50     :name: amdgpu-architecture-table
51
52     ============ ==============================================================
53     Architecture Description
54     ============ ==============================================================
55     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
56     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
57     ============ ==============================================================
58
59  .. table:: AMDGPU Vendors
60     :name: amdgpu-vendor-table
61
62     ============ ==============================================================
63     Vendor       Description
64     ============ ==============================================================
65     ``amd``      Can be used for all AMD GPU usage.
66     ``mesa3d``   Can be used if the OS is ``mesa3d``.
67     ============ ==============================================================
68
69  .. table:: AMDGPU Operating Systems
70     :name: amdgpu-os
71
72     ============== ============================================================
73     OS             Description
74     ============== ============================================================
75     *<empty>*      Defaults to the *unknown* OS.
76     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
77                    such as:
78
79                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
80                      loader on Linux. See *AMD ROCm Platform Release Notes*
81                      [AMD-ROCm-Release-Notes]_ for supported hardware and
82                      software.
83                    - AMD's PAL runtime using the *pal-amdhsa* loader on
84                      Windows.
85
86     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
87                    runtime using the *pal-amdpal* loader on Windows and Linux
88                    Pro.
89     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
90                    3D runtime using the *mesa-mesa3d* loader on Linux.
91     ============== ============================================================
92
93  .. table:: AMDGPU Environments
94     :name: amdgpu-environment-table
95
96     ============ ==============================================================
97     Environment  Description
98     ============ ==============================================================
99     *<empty>*    Default.
100     ============ ==============================================================
101
102.. _amdgpu-processors:
103
104Processors
105----------
106
107Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
108specify the AMDGPU processor together with optional target features. See
109:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
110specific information.
111
112Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
113
114* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
115
116
117  .. table:: AMDGPU Processors
118     :name: amdgpu-processor-table
119
120     =========== =============== ============ ===== ================= =============== =============== ======================
121     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
122                 Processor       Triple       APU   Features          Properties      *(see*          Products
123                                 Architecture       Supported                         `amdgpu-os`_
124                                                                                      *and
125                                                                                      corresponding
126                                                                                      runtime release
127                                                                                      notes for
128                                                                                      current
129                                                                                      information and
130                                                                                      level of
131                                                                                      support)*
132     =========== =============== ============ ===== ================= =============== =============== ======================
133     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
134     -----------------------------------------------------------------------------------------------------------------------
135     ``r600``                    ``r600``     dGPU                    - Does not
136                                                                        support
137                                                                        generic
138                                                                        address
139                                                                        space
140     ``r630``                    ``r600``     dGPU                    - Does not
141                                                                        support
142                                                                        generic
143                                                                        address
144                                                                        space
145     ``rs880``                   ``r600``     dGPU                    - Does not
146                                                                        support
147                                                                        generic
148                                                                        address
149                                                                        space
150     ``rv670``                   ``r600``     dGPU                    - Does not
151                                                                        support
152                                                                        generic
153                                                                        address
154                                                                        space
155     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
156     -----------------------------------------------------------------------------------------------------------------------
157     ``rv710``                   ``r600``     dGPU                    - Does not
158                                                                        support
159                                                                        generic
160                                                                        address
161                                                                        space
162     ``rv730``                   ``r600``     dGPU                    - Does not
163                                                                        support
164                                                                        generic
165                                                                        address
166                                                                        space
167     ``rv770``                   ``r600``     dGPU                    - Does not
168                                                                        support
169                                                                        generic
170                                                                        address
171                                                                        space
172     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
173     -----------------------------------------------------------------------------------------------------------------------
174     ``cedar``                   ``r600``     dGPU                    - Does not
175                                                                        support
176                                                                        generic
177                                                                        address
178                                                                        space
179     ``cypress``                 ``r600``     dGPU                    - Does not
180                                                                        support
181                                                                        generic
182                                                                        address
183                                                                        space
184     ``juniper``                 ``r600``     dGPU                    - Does not
185                                                                        support
186                                                                        generic
187                                                                        address
188                                                                        space
189     ``redwood``                 ``r600``     dGPU                    - Does not
190                                                                        support
191                                                                        generic
192                                                                        address
193                                                                        space
194     ``sumo``                    ``r600``     dGPU                    - Does not
195                                                                        support
196                                                                        generic
197                                                                        address
198                                                                        space
199     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
200     -----------------------------------------------------------------------------------------------------------------------
201     ``barts``                   ``r600``     dGPU                    - Does not
202                                                                        support
203                                                                        generic
204                                                                        address
205                                                                        space
206     ``caicos``                  ``r600``     dGPU                    - Does not
207                                                                        support
208                                                                        generic
209                                                                        address
210                                                                        space
211     ``cayman``                  ``r600``     dGPU                    - Does not
212                                                                        support
213                                                                        generic
214                                                                        address
215                                                                        space
216     ``turks``                   ``r600``     dGPU                    - Does not
217                                                                        support
218                                                                        generic
219                                                                        address
220                                                                        space
221     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
222     -----------------------------------------------------------------------------------------------------------------------
223     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
224                                                                        support
225                                                                        generic
226                                                                        address
227                                                                        space
228     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
229                 - ``verde``                                            support
230                                                                        generic
231                                                                        address
232                                                                        space
233     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
234                 - ``oland``                                            support
235                                                                        generic
236                                                                        address
237                                                                        space
238     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
239     -----------------------------------------------------------------------------------------------------------------------
240     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
241                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
242                                                                        scratch       - *pal-amdpal*  - A8-7100
243                                                                                                      - A8 Pro-7150B
244                                                                                                      - A10-7300
245                                                                                                      - A10 Pro-7350B
246                                                                                                      - FX-7500
247                                                                                                      - A8-7200P
248                                                                                                      - A10-7400P
249                                                                                                      - FX-7600P
250     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
251                                                                        flat          - *pal-amdhsa*  - FirePro W9100
252                                                                        scratch       - *pal-amdpal*  - FirePro S9150
253                                                                                                      - FirePro S9170
254     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
255                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
256                                                                        scratch       - *pal-amdpal*  - Radeon R390
257                                                                                                      - Radeon R390x
258     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
259                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
260                                                                        scratch                       - E1-2500
261                                                                                                      - E2-3000
262                                                                                                      - E2-3800
263                                                                                                      - A4-5000
264                                                                                                      - A4-5100
265                                                                                                      - A6-5200
266                                                                                                      - A4 Pro-3340B
267     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
268                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
269                                                                        scratch                       - R7 260
270                                                                                                      - R7 260X
271     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
272                                                                        flat          - *pal-amdpal*
273                                                                        scratch                       .. TODO::
274
275                                                                                                        Add product
276                                                                                                        names.
277
278     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
279     -----------------------------------------------------------------------------------------------------------------------
280     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
281                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
282                                                                        scratch       - *pal-amdpal*  - A8-8600P
283                                                                                                      - Pro A8-8600B
284                                                                                                      - FX-8800P
285                                                                                                      - Pro A12-8800B
286                                                                                                      - A10-8700P
287                                                                                                      - Pro A10-8700B
288                                                                                                      - A10-8780P
289                                                                                                      - A10-9600P
290                                                                                                      - A10-9630P
291                                                                                                      - A12-9700P
292                                                                                                      - A12-9730P
293                                                                                                      - FX-9800P
294                                                                                                      - FX-9830P
295                                                                                                      - E2-9010
296                                                                                                      - A6-9210
297                                                                                                      - A9-9410
298     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
299                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
300                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
301     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
302                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
303                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
304                                                                                                      - Radeon Pro Duo
305                                                                                                      - FirePro S9300x2
306                                                                                                      - Radeon Instinct MI8
307     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
308                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
309                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
310     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
311                                                                        flat          - *pal-amdhsa*
312                                                                        scratch       - *pal-amdpal*
313     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
314                                                                        flat          - *pal-amdhsa*  - FirePro S7100
315                                                                        scratch       - *pal-amdpal*  - FirePro W7100
316                                                                                                      - Mobile FirePro
317                                                                                                        M7170
318     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
319                                                                        flat          - *pal-amdhsa*
320                                                                        scratch       - *pal-amdpal*  .. TODO::
321
322                                                                                                        Add product
323                                                                                                        names.
324
325     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
326     -----------------------------------------------------------------------------------------------------------------------
327     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
328                                                                        flat          - *pal-amdhsa*    Frontier Edition
329                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
330                                                                                                      - Radeon RX Vega 64
331                                                                                                      - Radeon RX Vega 64
332                                                                                                        Liquid
333                                                                                                      - Radeon Instinct MI25
334     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
335                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
336                                                                        scratch       - *pal-amdpal*
337     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
338                                                                                      - *pal-amdhsa*
339                                                                                      - *pal-amdpal*  .. TODO::
340
341                                                                                                        Add product
342                                                                                                        names.
343
344     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
345                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
346                                                                        scratch       - *pal-amdpal*  - Radeon VII
347                                                                                                      - Radeon Pro VII
348     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
349                                                    - xnack           - Absolute
350                                                                        flat
351                                                                        scratch
352     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
353                                                                        flat
354                                                                        scratch                       .. TODO::
355
356                                                                                                        Add product
357                                                                                                        names.
358
359     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
360                                                    - tgsplit           flat
361                                                    - xnack             scratch                       .. TODO::
362                                                                      - Packed
363                                                                        work-item                       Add product
364                                                                        IDs                             names.
365
366     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
367                                                                        flat                          - Ryzen 7 4700GE
368                                                                        scratch                       - Ryzen 5 4600G
369                                                                                                      - Ryzen 5 4600GE
370                                                                                                      - Ryzen 3 4300G
371                                                                                                      - Ryzen 3 4300GE
372                                                                                                      - Ryzen Pro 4000G
373                                                                                                      - Ryzen 7 Pro 4700G
374                                                                                                      - Ryzen 7 Pro 4750GE
375                                                                                                      - Ryzen 5 Pro 4650G
376                                                                                                      - Ryzen 5 Pro 4650GE
377                                                                                                      - Ryzen 3 Pro 4350G
378                                                                                                      - Ryzen 3 Pro 4350GE
379
380     ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
381                                                    - tgsplit           flat
382                                                    - xnack             scratch                       .. TODO::
383                                                                      - Packed
384                                                                        work-item                       Add product
385                                                                        IDs                             names.
386
387     **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
388     -----------------------------------------------------------------------------------------------------------------------
389     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
390                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
391                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
392                                                                                                      - Radeon Pro 5600M
393     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
394                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
395                                                    - xnack             flat          - *pal-amdpal*
396                                                                        scratch
397     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
398                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
399                                                    - xnack             scratch       - *pal-amdpal*
400     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
401                                                    - wavefrontsize64   flat          - *pal-amdhsa*
402                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
403
404                                                                                                        Add product
405                                                                                                        names.
406
407     **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
408     -----------------------------------------------------------------------------------------------------------------------
409     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
410                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
411                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
412     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
413                                                    - wavefrontsize64   flat          - *pal-amdhsa*
414                                                                        scratch       - *pal-amdpal*
415     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
416                                                    - wavefrontsize64   flat          - *pal-amdhsa*
417                                                                        scratch       - *pal-amdpal*  .. TODO::
418
419                                                                                                        Add product
420                                                                                                        names.
421
422     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
423                                                    - wavefrontsize64   flat
424                                                                        scratch                       .. TODO::
425
426                                                                                                        Add product
427                                                                                                        names.
428     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
429                                                    - wavefrontsize64   flat
430                                                                        scratch                       .. TODO::
431
432                                                                                                        Add product
433                                                                                                        names.
434
435     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
436                                                    - wavefrontsize64   flat
437                                                                        scratch                       .. TODO::
438                                                                                                        Add product
439                                                                                                        names.
440
441     ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
442                                                    - wavefrontsize64   flat
443                                                                        scratch                       .. TODO::
444
445                                                                                                        Add product
446                                                                                                        names.
447
448     **GCN GFX11**
449     -----------------------------------------------------------------------------------------------------------------------
450     ``gfx1100``                 ``amdgcn``   dGPU  - cumode          - Architected   - *pal-amdpal*  *TBA*
451                                                    - wavefrontsize64   flat
452                                                                        scratch                       .. TODO::
453                                                                      - Packed
454                                                                        work-item                       Add product
455                                                                        IDs                             names.
456
457     ``gfx1101``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
458                                                    - wavefrontsize64   flat
459                                                                        scratch                       .. TODO::
460                                                                      - Packed
461                                                                        work-item                       Add product
462                                                                        IDs                             names.
463
464     ``gfx1102``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
465                                                    - wavefrontsize64   flat
466                                                                        scratch                       .. TODO::
467                                                                      - Packed
468                                                                        work-item                       Add product
469                                                                        IDs                             names.
470
471     ``gfx1103``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
472                                                    - wavefrontsize64   flat
473                                                                        scratch                       .. TODO::
474                                                                      - Packed
475                                                                        work-item                       Add product
476                                                                        IDs                             names.
477
478     =========== =============== ============ ===== ================= =============== =============== ======================
479
480.. _amdgpu-target-features:
481
482Target Features
483---------------
484
485Target features control how code is generated to support certain
486processor specific features. Not all target features are supported by
487all processors. The runtime must ensure that the features supported by
488the device used to execute the code match the features enabled when
489generating the code. A mismatch of features may result in incorrect
490execution, or a reduction in performance.
491
492The target features supported by each processor is listed in
493:ref:`amdgpu-processor-table`.
494
495Target features are controlled by exactly one of the following Clang
496options:
497
498``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
499
500  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
501  optional components of the target ID. If omitted, the target feature has the
502  ``any`` value. See :ref:`amdgpu-target-id`.
503
504``-m[no-]<target-feature>``
505
506  Target features not specified by the target ID are specified using a
507  separate option. These target features can have an ``on`` or ``off``
508  value.  ``on`` is specified by omitting the ``no-`` prefix, and
509  ``off`` is specified by including the ``no-`` prefix. The default
510  if not specified is ``off``.
511
512For example:
513
514``-mcpu=gfx908:xnack+``
515  Enable the ``xnack`` feature.
516``-mcpu=gfx908:xnack-``
517  Disable the ``xnack`` feature.
518``-mcumode``
519  Enable the ``cumode`` feature.
520``-mno-cumode``
521  Disable the ``cumode`` feature.
522
523  .. table:: AMDGPU Target Features
524     :name: amdgpu-target-features-table
525
526     =============== ============================ ==================================================
527     Target Feature  Clang Option to Control      Description
528     Name
529     =============== ============================ ==================================================
530     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
531                                                  when generating code for kernels. When disabled
532                                                  native WGP wavefront execution mode is used,
533                                                  when enabled CU wavefront execution mode is used
534                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
535
536     sramecc         - ``-mcpu``                  If specified, generate code that can only be
537                     - ``--offload-arch``         loaded and executed in a process that has a
538                                                  matching setting for SRAMECC.
539
540                                                  If not specified for code object V2 to V3, generate
541                                                  code that can be loaded and executed in a process
542                                                  with SRAMECC enabled.
543
544                                                  If not specified for code object V4 or above, generate
545                                                  code that can be loaded and executed in a process
546                                                  with either setting of SRAMECC.
547
548     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
549                                                  work-groups are launched in threadgroup split mode.
550                                                  When enabled the waves of a work-group may be
551                                                  launched in different CUs.
552
553     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
554                                                  generating code for kernels. When disabled
555                                                  native wavefront size 32 is used, when enabled
556                                                  wavefront size 64 is used.
557
558     xnack           - ``-mcpu``                  If specified, generate code that can only be
559                     - ``--offload-arch``         loaded and executed in a process that has a
560                                                  matching setting for XNACK replay.
561
562                                                  If not specified for code object V2 to V3, generate
563                                                  code that can be loaded and executed in a process
564                                                  with XNACK replay enabled.
565
566                                                  If not specified for code object V4 or above, generate
567                                                  code that can be loaded and executed in a process
568                                                  with either setting of XNACK replay.
569
570                                                  XNACK replay can be used for demand paging and
571                                                  page migration. If enabled in the device, then if
572                                                  a page fault occurs the code may execute
573                                                  incorrectly unless generated with XNACK replay
574                                                  enabled, or generated for code object V4 or above without
575                                                  specifying XNACK replay. Executing code that was
576                                                  generated with XNACK replay enabled, or generated
577                                                  for code object V4 or above without specifying XNACK replay,
578                                                  on a device that does not have XNACK replay
579                                                  enabled will execute correctly but may be less
580                                                  performant than code generated for XNACK replay
581                                                  disabled.
582     =============== ============================ ==================================================
583
584.. _amdgpu-target-id:
585
586Target ID
587---------
588
589AMDGPU supports target IDs. See `Clang Offload Bundler
590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
591description. The AMDGPU target specific information is:
592
593**processor**
594  Is an AMDGPU processor or alternative processor name specified in
595  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
596  the primary processor and alternative processor names. The canonical form
597  target ID only allow the primary processor name.
598
599**target-feature**
600  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
601  is supported by the processor. The target features supported by each processor
602  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
603  a target ID are marked as being controlled by ``-mcpu`` and
604  ``--offload-arch``. Each target feature must appear at most once in a target
605  ID. The non-canonical form target ID allows the target features to be
606  specified in any order. The canonical form target ID requires the target
607  features to be specified in alphabetic order.
608
609.. _amdgpu-target-id-v2-v3:
610
611Code Object V2 to V3 Target ID
612~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
613
614The target ID syntax for code object V2 to V3 is the same as defined in `Clang
615Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
616when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
617directive and the bundle entry ID. In those cases it has the following BNF
618syntax:
619
620.. code::
621
622  <target-id> ::== <processor> ( "+" <target-feature> )*
623
624Where a target feature is omitted if *Off* and present if *On* or *Any*.
625
626.. note::
627
628  The code object V2 to V3 cannot represent *Any* and treats it the same as
629  *On*.
630
631.. _amdgpu-embedding-bundled-objects:
632
633Embedding Bundled Code Objects
634------------------------------
635
636AMDGPU supports the HIP and OpenMP languages that perform code object embedding
637as described in `Clang Offload Bundler
638<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
639
640.. note::
641
642  The target ID syntax used for code object V2 to V3 for a bundle entry ID
643  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
644
645.. _amdgpu-address-spaces:
646
647Address Spaces
648--------------
649
650The AMDGPU architecture supports a number of memory address spaces. The address
651space names use the OpenCL standard names, with some additions.
652
653The AMDGPU address spaces correspond to target architecture specific LLVM
654address space numbers used in LLVM IR.
655
656The AMDGPU address spaces are described in
657:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
658supported for the ``amdgcn`` target.
659
660  .. table:: AMDGPU Address Spaces
661     :name: amdgpu-address-spaces-table
662
663     ================================= =============== =========== ================ ======= ============================
664     ..                                                                                     64-Bit Process Address Space
665     --------------------------------- --------------- ----------- ---------------- ------------------------------------
666     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
667                                       Space Number    Name        Name             Size
668     ================================= =============== =========== ================ ======= ============================
669     Generic                           0               flat        flat             64      0x0000000000000000
670     Global                            1               global      global           64      0x0000000000000000
671     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
672     Local                             3               group       LDS              32      0xFFFFFFFF
673     Constant                          4               constant    *same as global* 64      0x0000000000000000
674     Private                           5               private     scratch          32      0xFFFFFFFF
675     Constant 32-bit                   6               *TODO*                               0x00000000
676     Buffer Fat Pointer (experimental) 7               *TODO*
677     ================================= =============== =========== ================ ======= ============================
678
679**Generic**
680  The generic address space is supported unless the *Target Properties* column
681  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
682  space*.
683
684  The generic address space uses the hardware flat address support for two fixed
685  ranges of virtual addresses (the private and local apertures), that are
686  outside the range of addressable global memory, to map from a flat address to
687  a private or local address. This uses FLAT instructions that can take a flat
688  address and access global, private (scratch), and group (LDS) memory depending
689  on if the address is within one of the aperture ranges.
690
691  Flat access to scratch requires hardware aperture setup and setup in the
692  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
693  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
694  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
695
696  To convert between a private or group address space address (termed a segment
697  address) and a flat address the base address of the corresponding aperture
698  can be used. For GFX7-GFX8 these are available in the
699  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
700  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
701  GFX9-GFX11 the aperture base addresses are directly available as inline
702  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
703  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
704  aligned to 2^32 which makes it easier to convert from flat to segment or
705  segment to flat.
706
707  A global address space address has the same value when used as a flat address
708  so no conversion is needed.
709
710**Global and Constant**
711  The global and constant address spaces both use global virtual addresses,
712  which are the same virtual address space used by the CPU. However, some
713  virtual addresses may only be accessible to the CPU, some only accessible
714  by the GPU, and some by both.
715
716  Using the constant address space indicates that the data will not change
717  during the execution of the kernel. This allows scalar read instructions to
718  be used. As the constant address space could only be modified on the host
719  side, a generic pointer loaded from the constant address space is safe to be
720  assumed as a global pointer since only the device global memory is visible
721  and managed on the host side. The vector and scalar L1 caches are invalidated
722  of volatile data before each kernel dispatch execution to allow constant
723  memory to change values between kernel dispatches.
724
725**Region**
726  The region address space uses the hardware Global Data Store (GDS). All
727  wavefronts executing on the same device will access the same memory for any
728  given region address. However, the same region address accessed by wavefronts
729  executing on different devices will access different memory. It is higher
730  performance than global memory. It is allocated by the runtime. The data
731  store (DS) instructions can be used to access it.
732
733**Local**
734  The local address space uses the hardware Local Data Store (LDS) which is
735  automatically allocated when the hardware creates the wavefronts of a
736  work-group, and freed when all the wavefronts of a work-group have
737  terminated. All wavefronts belonging to the same work-group will access the
738  same memory for any given local address. However, the same local address
739  accessed by wavefronts belonging to different work-groups will access
740  different memory. It is higher performance than global memory. The data store
741  (DS) instructions can be used to access it.
742
743**Private**
744  The private address space uses the hardware scratch memory support which
745  automatically allocates memory when it creates a wavefront and frees it when
746  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
747  given private address will be different to the memory accessed by another lane
748  of the same or different wavefront for the same private address.
749
750  If a kernel dispatch uses scratch, then the hardware allocates memory from a
751  pool of backing memory allocated by the runtime for each wavefront. The lanes
752  of the wavefront access this using dword (4 byte) interleaving. The mapping
753  used from private address to backing memory address is:
754
755    ``wavefront-scratch-base +
756    ((private-address / 4) * wavefront-size * 4) +
757    (wavefront-lane-id * 4) + (private-address % 4)``
758
759  If each lane of a wavefront accesses the same private address, the
760  interleaving results in adjacent dwords being accessed and hence requires
761  fewer cache lines to be fetched.
762
763  There are different ways that the wavefront scratch base address is
764  determined by a wavefront (see
765  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
766
767  Scratch memory can be accessed in an interleaved manner using buffer
768  instructions with the scratch buffer descriptor and per wavefront scratch
769  offset, by the scratch instructions, or by flat instructions. Multi-dword
770  access is not supported except by flat and scratch instructions in
771  GFX9-GFX11.
772
773**Constant 32-bit**
774  *TODO*
775
776**Buffer Fat Pointer**
777  The buffer fat pointer is an experimental address space that is currently
778  unsupported in the backend. It exposes a non-integral pointer that is in
779  the future intended to support the modelling of 128-bit buffer descriptors
780  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
781  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
782  model the buffer descriptors used heavily in graphics workloads targeting
783  the backend.
784
785.. _amdgpu-memory-scopes:
786
787Memory Scopes
788-------------
789
790This section provides LLVM memory synchronization scopes supported by the AMDGPU
791backend memory model when the target triple OS is ``amdhsa`` (see
792:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
793
794The memory model supported is based on the HSA memory model [HSA]_ which is
795based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
796relation is transitive over the synchronizes-with relation independent of scope
797and synchronizes-with allows the memory scope instances to be inclusive (see
798table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
799
800This is different to the OpenCL [OpenCL]_ memory model which does not have scope
801inclusion and requires the memory scopes to exactly match. However, this
802is conservatively correct for OpenCL.
803
804  .. table:: AMDHSA LLVM Sync Scopes
805     :name: amdgpu-amdhsa-llvm-sync-scopes-table
806
807     ======================= ===================================================
808     LLVM Sync Scope         Description
809     ======================= ===================================================
810     *none*                  The default: ``system``.
811
812                             Synchronizes with, and participates in modification
813                             and seq_cst total orderings with, other operations
814                             (except image operations) for all address spaces
815                             (except private, or generic that accesses private)
816                             provided the other operation's sync scope is:
817
818                             - ``system``.
819                             - ``agent`` and executed by a thread on the same
820                               agent.
821                             - ``workgroup`` and executed by a thread in the
822                               same work-group.
823                             - ``wavefront`` and executed by a thread in the
824                               same wavefront.
825
826     ``agent``               Synchronizes with, and participates in modification
827                             and seq_cst total orderings with, other operations
828                             (except image operations) for all address spaces
829                             (except private, or generic that accesses private)
830                             provided the other operation's sync scope is:
831
832                             - ``system`` or ``agent`` and executed by a thread
833                               on the same agent.
834                             - ``workgroup`` and executed by a thread in the
835                               same work-group.
836                             - ``wavefront`` and executed by a thread in the
837                               same wavefront.
838
839     ``workgroup``           Synchronizes with, and participates in modification
840                             and seq_cst total orderings with, other operations
841                             (except image operations) for all address spaces
842                             (except private, or generic that accesses private)
843                             provided the other operation's sync scope is:
844
845                             - ``system``, ``agent`` or ``workgroup`` and
846                               executed by a thread in the same work-group.
847                             - ``wavefront`` and executed by a thread in the
848                               same wavefront.
849
850     ``wavefront``           Synchronizes with, and participates in modification
851                             and seq_cst total orderings with, other operations
852                             (except image operations) for all address spaces
853                             (except private, or generic that accesses private)
854                             provided the other operation's sync scope is:
855
856                             - ``system``, ``agent``, ``workgroup`` or
857                               ``wavefront`` and executed by a thread in the
858                               same wavefront.
859
860     ``singlethread``        Only synchronizes with and participates in
861                             modification and seq_cst total orderings with,
862                             other operations (except image operations) running
863                             in the same thread for all address spaces (for
864                             example, in signal handlers).
865
866     ``one-as``              Same as ``system`` but only synchronizes with other
867                             operations within the same address space.
868
869     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
870                             operations within the same address space.
871
872     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
873                             other operations within the same address space.
874
875     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
876                             other operations within the same address space.
877
878     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
879                             other operations within the same address space.
880     ======================= ===================================================
881
882LLVM IR Intrinsics
883------------------
884
885The AMDGPU backend implements the following LLVM IR intrinsics.
886
887*This section is WIP.*
888
889.. TODO::
890
891   List AMDGPU intrinsics.
892
893LLVM IR Attributes
894------------------
895
896The AMDGPU backend supports the following LLVM IR attributes.
897
898  .. table:: AMDGPU LLVM IR Attributes
899     :name: amdgpu-llvm-ir-attributes-table
900
901     ======================================= ==========================================================
902     LLVM Attribute                          Description
903     ======================================= ==========================================================
904     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
905                                             will be specified when the kernel is dispatched. Generated
906                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
907                                             The implied default value is 1,1024.
908
909     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
910                                             argument block size for the implicit arguments. This
911                                             varies by OS and language (for OpenCL see
912                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
913     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
914                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
915     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
916                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
917     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
918                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
919                                             CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
920                                             and the backend may not be able to satisfy the request. If
921                                             the specified range is incompatible with the function's
922                                             "amdgpu-flat-work-group-size" value, the implied occupancy
923                                             bounds by the workgroup size takes precedence.
924
925     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
926                                             mode register to be set on entry. Overrides the default for
927                                             the calling convention.
928     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
929                                             the mode register to be set on entry. Overrides the default
930                                             for the calling convention.
931
932     "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
933                                             llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
934                                             attribute, or reached through a call site marked with this attribute,
935                                             the value returned by the intrinsic is undefined. The backend can
936                                             generally infer this during code generation, so typically there is no
937                                             benefit to frontends marking functions with this.
938
939     "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
940                                             llvm.amdgcn.workitem.id.y intrinsic.
941
942     "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
943                                             llvm.amdgcn.workitem.id.z intrinsic.
944
945     "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
946                                             llvm.amdgcn.workgroup.id.x intrinsic.
947
948     "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
949                                             llvm.amdgcn.workgroup.id.y intrinsic.
950
951     "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
952                                             llvm.amdgcn.workgroup.id.z intrinsic.
953
954     "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
955                                             llvm.amdgcn.dispatch.ptr intrinsic.
956
957     "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
958                                             llvm.amdgcn.implicitarg.ptr intrinsic.
959
960     "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
961                                             llvm.amdgcn.dispatch.id intrinsic.
962
963     "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
964                                             llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
965                                             attributes, the queue pointer may be required in situations where the
966                                             intrinsic call does not directly appear in the program. Some subtargets
967                                             require the queue pointer for to handle some addrspacecasts, as well
968                                             as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
969                                             llvm.debug intrinsics.
970
971     "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
972                                             kernel argument that holds the pointer to the hostcall buffer. If this
973                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
974
975     "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
976                                             kernel argument that holds the pointer to an initialized memory buffer
977                                             that conforms to the requirements of the malloc/free device library V1
978                                             version implementation. If this attribute is absent, then the
979                                             amdgpu-no-implicitarg-ptr is also removed.
980
981     "amdgpu-no-multigrid-sync-arg"          Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
982                                             kernel argument that holds the multigrid synchronization pointer. If this
983                                             attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
984     ======================================= ==========================================================
985
986.. _amdgpu-elf-code-object:
987
988ELF Code Object
989===============
990
991The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
992can be linked by ``lld`` to produce a standard ELF shared code object which can
993be loaded and executed on an AMDGPU target.
994
995.. _amdgpu-elf-header:
996
997Header
998------
999
1000The AMDGPU backend uses the following ELF header:
1001
1002  .. table:: AMDGPU ELF Header
1003     :name: amdgpu-elf-header-table
1004
1005     ========================== ===============================
1006     Field                      Value
1007     ========================== ===============================
1008     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
1009     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
1010     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
1011                                - ``ELFOSABI_AMDGPU_HSA``
1012                                - ``ELFOSABI_AMDGPU_PAL``
1013                                - ``ELFOSABI_AMDGPU_MESA3D``
1014     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1015                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
1016                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
1017                                - ``ELFABIVERSION_AMDGPU_HSA_V5``
1018                                - ``ELFABIVERSION_AMDGPU_PAL``
1019                                - ``ELFABIVERSION_AMDGPU_MESA3D``
1020     ``e_type``                 - ``ET_REL``
1021                                - ``ET_DYN``
1022     ``e_machine``              ``EM_AMDGPU``
1023     ``e_entry``                0
1024     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1025                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
1026                                and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1027     ========================== ===============================
1028
1029..
1030
1031  .. table:: AMDGPU ELF Header Enumeration Values
1032     :name: amdgpu-elf-header-enumeration-values-table
1033
1034     =============================== =====
1035     Name                            Value
1036     =============================== =====
1037     ``EM_AMDGPU``                   224
1038     ``ELFOSABI_NONE``               0
1039     ``ELFOSABI_AMDGPU_HSA``         64
1040     ``ELFOSABI_AMDGPU_PAL``         65
1041     ``ELFOSABI_AMDGPU_MESA3D``      66
1042     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1043     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1044     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1045     ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1046     ``ELFABIVERSION_AMDGPU_PAL``    0
1047     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1048     =============================== =====
1049
1050``e_ident[EI_CLASS]``
1051  The ELF class is:
1052
1053  * ``ELFCLASS32`` for ``r600`` architecture.
1054
1055  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1056    process address space applications.
1057
1058``e_ident[EI_DATA]``
1059  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1060
1061``e_ident[EI_OSABI]``
1062  One of the following AMDGPU target architecture specific OS ABIs
1063  (see :ref:`amdgpu-os`):
1064
1065  * ``ELFOSABI_NONE`` for *unknown* OS.
1066
1067  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1068
1069  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1070
1071  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1072
1073``e_ident[EI_ABIVERSION]``
1074  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1075  object conforms:
1076
1077  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1078    runtime ABI for code object V2. Specify using the Clang option
1079    ``-mcode-object-version=2``.
1080
1081  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1082    runtime ABI for code object V3. Specify using the Clang option
1083    ``-mcode-object-version=3``.
1084
1085  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1086    runtime ABI for code object V4. Specify using the Clang option
1087    ``-mcode-object-version=4``. This is the default code object
1088    version if not specified.
1089
1090  * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1091    runtime ABI for code object V5. Specify using the Clang option
1092    ``-mcode-object-version=5``.
1093
1094  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1095    runtime ABI.
1096
1097  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1098    3D runtime ABI.
1099
1100``e_type``
1101  Can be one of the following values:
1102
1103
1104  ``ET_REL``
1105    The type produced by the AMDGPU backend compiler as it is relocatable code
1106    object.
1107
1108  ``ET_DYN``
1109    The type produced by the linker as it is a shared code object.
1110
1111  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1112
1113``e_machine``
1114  The value ``EM_AMDGPU`` is used for the machine for all processors supported
1115  by the ``r600`` and ``amdgcn`` architectures (see
1116  :ref:`amdgpu-processor-table`). The specific processor is specified in the
1117  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1118  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1119  ``e_flags`` for code object V3 and above (see
1120  :ref:`amdgpu-elf-header-e_flags-table-v3` and
1121  :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1122
1123``e_entry``
1124  The entry point is 0 as the entry points for individual kernels must be
1125  selected in order to invoke them through AQL packets.
1126
1127``e_flags``
1128  The AMDGPU backend uses the following ELF header flags:
1129
1130  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1131     :name: amdgpu-elf-header-e_flags-v2-table
1132
1133     ===================================== ===== =============================
1134     Name                                  Value Description
1135     ===================================== ===== =============================
1136     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1137                                                 target feature is
1138                                                 enabled for all code
1139                                                 contained in the code object.
1140                                                 If the processor
1141                                                 does not support the
1142                                                 ``xnack`` target
1143                                                 feature then must
1144                                                 be 0.
1145                                                 See
1146                                                 :ref:`amdgpu-target-features`.
1147     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1148                                                 handler is enabled for all
1149                                                 code contained in the code
1150                                                 object. If the processor
1151                                                 does not support a trap
1152                                                 handler then must be 0.
1153                                                 See
1154                                                 :ref:`amdgpu-target-features`.
1155     ===================================== ===== =============================
1156
1157  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1158     :name: amdgpu-elf-header-e_flags-table-v3
1159
1160     ================================= ===== =============================
1161     Name                              Value Description
1162     ================================= ===== =============================
1163     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1164                                             mask for
1165                                             ``EF_AMDGPU_MACH_xxx`` values
1166                                             defined in
1167                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
1168     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1169                                             target feature is
1170                                             enabled for all code
1171                                             contained in the code object.
1172                                             If the processor
1173                                             does not support the
1174                                             ``xnack`` target
1175                                             feature then must
1176                                             be 0.
1177                                             See
1178                                             :ref:`amdgpu-target-features`.
1179     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1180                                             target feature is
1181                                             enabled for all code
1182                                             contained in the code object.
1183                                             If the processor
1184                                             does not support the
1185                                             ``sramecc`` target
1186                                             feature then must
1187                                             be 0.
1188                                             See
1189                                             :ref:`amdgpu-target-features`.
1190     ================================= ===== =============================
1191
1192  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1193     :name: amdgpu-elf-header-e_flags-table-v4-onwards
1194
1195     ============================================ ===== ===================================
1196     Name                                         Value      Description
1197     ============================================ ===== ===================================
1198     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1199                                                        mask for
1200                                                        ``EF_AMDGPU_MACH_xxx`` values
1201                                                        defined in
1202                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
1203     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1204                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1205                                                        values.
1206     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1207     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1208     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1209     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1210     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1211                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1212                                                        values.
1213     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1214     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1215     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1216     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1217     ============================================ ===== ===================================
1218
1219  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1220     :name: amdgpu-ef-amdgpu-mach-table
1221
1222     ==================================== ========== =============================
1223     Name                                 Value      Description (see
1224                                                     :ref:`amdgpu-processor-table`)
1225     ==================================== ========== =============================
1226     ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1227     ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1228     ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1229     ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1230     ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1231     ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1232     ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1233     ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1234     ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1235     ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1236     ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1237     ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1238     ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1239     ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1240     ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1241     ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1242     ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1243     *reserved*                           0x011 -    Reserved for ``r600``
1244                                          0x01f      architecture processors.
1245     ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1246     ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1247     ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1248     ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1249     ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1250     ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1251     ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1252     *reserved*                           0x027      Reserved.
1253     ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1254     ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1255     ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1256     ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1257     ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1258     ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1259     ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1260     ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1261     ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1262     ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1263     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1264     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1265     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1266     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1267     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1268     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1269     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1270     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1271     ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1272     ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1273     ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1274     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1275     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1276     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1277     ``EF_AMDGPU_MACH_AMDGCN_GFX940``     0x040      ``gfx940``
1278     ``EF_AMDGPU_MACH_AMDGCN_GFX1100``    0x041      ``gfx1100``
1279     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1280     *reserved*                           0x043      Reserved.
1281     ``EF_AMDGPU_MACH_AMDGCN_GFX1103``    0x044      ``gfx1103``
1282     ``EF_AMDGPU_MACH_AMDGCN_GFX1036``    0x045      ``gfx1036``
1283     ``EF_AMDGPU_MACH_AMDGCN_GFX1101``    0x046      ``gfx1101``
1284     ``EF_AMDGPU_MACH_AMDGCN_GFX1102``    0x047      ``gfx1102``
1285     ==================================== ========== =============================
1286
1287Sections
1288--------
1289
1290An AMDGPU target ELF code object has the standard ELF sections which include:
1291
1292  .. table:: AMDGPU ELF Sections
1293     :name: amdgpu-elf-sections-table
1294
1295     ================== ================ =================================
1296     Name               Type             Attributes
1297     ================== ================ =================================
1298     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1299     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1300     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1301     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1302     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1303     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1304     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1305     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1306     ``.note``          ``SHT_NOTE``     *none*
1307     ``.rela``\ *name*  ``SHT_RELA``     *none*
1308     ``.rela.dyn``      ``SHT_RELA``     *none*
1309     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1310     ``.shstrtab``      ``SHT_STRTAB``   *none*
1311     ``.strtab``        ``SHT_STRTAB``   *none*
1312     ``.symtab``        ``SHT_SYMTAB``   *none*
1313     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1314     ================== ================ =================================
1315
1316These sections have their standard meanings (see [ELF]_) and are only generated
1317if needed.
1318
1319``.debug``\ *\**
1320  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1321  information on the DWARF produced by the AMDGPU backend.
1322
1323``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1324  The standard sections used by a dynamic loader.
1325
1326``.note``
1327  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1328  backend.
1329
1330``.rela``\ *name*, ``.rela.dyn``
1331  For relocatable code objects, *name* is the name of the section that the
1332  relocation records apply. For example, ``.rela.text`` is the section name for
1333  relocation records associated with the ``.text`` section.
1334
1335  For linked shared code objects, ``.rela.dyn`` contains all the relocation
1336  records from each of the relocatable code object's ``.rela``\ *name* sections.
1337
1338  See :ref:`amdgpu-relocation-records` for the relocation records supported by
1339  the AMDGPU backend.
1340
1341``.text``
1342  The executable machine code for the kernels and functions they call. Generated
1343  as position independent code. See :ref:`amdgpu-code-conventions` for
1344  information on conventions used in the isa generation.
1345
1346.. _amdgpu-note-records:
1347
1348Note Records
1349------------
1350
1351The AMDGPU backend code object contains ELF note records in the ``.note``
1352section. The set of generated notes and their semantics depend on the code
1353object version; see :ref:`amdgpu-note-records-v2` and
1354:ref:`amdgpu-note-records-v3-onwards`.
1355
1356As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1357must be generated after the ``name`` field to ensure the ``desc`` field is 4
1358byte aligned. In addition, minimal zero-byte padding must be generated to
1359ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1360field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1361alignment.
1362
1363.. _amdgpu-note-records-v2:
1364
1365Code Object V2 Note Records
1366~~~~~~~~~~~~~~~~~~~~~~~~~~~
1367
1368.. warning::
1369  Code object V2 is not the default code object version emitted by
1370  this version of LLVM.
1371
1372The AMDGPU backend code object uses the following ELF note record in the
1373``.note`` section when compiling for code object V2.
1374
1375The note record vendor field is "AMD".
1376
1377Additional note records may be present, but any which are not documented here
1378are deprecated and should not be used.
1379
1380  .. table:: AMDGPU Code Object V2 ELF Note Records
1381     :name: amdgpu-elf-note-records-v2-table
1382
1383     ===== ===================================== ======================================
1384     Name  Type                                  Description
1385     ===== ===================================== ======================================
1386     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1387     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1388                                                 Finalizer and not the LLVM compiler.
1389     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1390     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1391                                                 YAML [YAML]_ textual format.
1392     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1393     ===== ===================================== ======================================
1394
1395..
1396
1397  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1398     :name: amdgpu-elf-note-record-enumeration-values-v2-table
1399
1400     ===================================== =====
1401     Name                                  Value
1402     ===================================== =====
1403     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1404     ``NT_AMD_HSA_HSAIL``                  2
1405     ``NT_AMD_HSA_ISA_VERSION``            3
1406     *reserved*                            4-9
1407     ``NT_AMD_HSA_METADATA``               10
1408     ``NT_AMD_HSA_ISA_NAME``               11
1409     ===================================== =====
1410
1411``NT_AMD_HSA_CODE_OBJECT_VERSION``
1412  Specifies the code object version number. The description field has the
1413  following layout:
1414
1415  .. code:: c
1416
1417    struct amdgpu_hsa_note_code_object_version_s {
1418      uint32_t major_version;
1419      uint32_t minor_version;
1420    };
1421
1422  The ``major_version`` has a value less than or equal to 2.
1423
1424``NT_AMD_HSA_HSAIL``
1425  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1426  field has the following layout:
1427
1428  .. code:: c
1429
1430    struct amdgpu_hsa_note_hsail_s {
1431      uint32_t hsail_major_version;
1432      uint32_t hsail_minor_version;
1433      uint8_t profile;
1434      uint8_t machine_model;
1435      uint8_t default_float_round;
1436    };
1437
1438``NT_AMD_HSA_ISA_VERSION``
1439  Specifies the target ISA version. The description field has the following layout:
1440
1441  .. code:: c
1442
1443    struct amdgpu_hsa_note_isa_s {
1444      uint16_t vendor_name_size;
1445      uint16_t architecture_name_size;
1446      uint32_t major;
1447      uint32_t minor;
1448      uint32_t stepping;
1449      char vendor_and_architecture_name[1];
1450    };
1451
1452  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1453  vendor and architecture names respectively, including the NUL character.
1454
1455  ``vendor_and_architecture_name`` contains the NUL terminates string for the
1456  vendor, immediately followed by the NUL terminated string for the
1457  architecture.
1458
1459  This note record is used by the HSA runtime loader.
1460
1461  Code object V2 only supports a limited number of processors and has fixed
1462  settings for target features. See
1463  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1464  processors and the corresponding target ID. In the table the note record ISA
1465  name is a concatenation of the vendor name, architecture name, major, minor,
1466  and stepping separated by a ":".
1467
1468  The target ID column shows the processor name and fixed target features used
1469  by the LLVM compiler. The LLVM compiler does not generate a
1470  ``NT_AMD_HSA_HSAIL`` note record.
1471
1472  A code object generated by the Finalizer also uses code object V2 and always
1473  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1474  ``sramecc`` target feature is as shown in
1475  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1476  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1477  bit.
1478
1479``NT_AMD_HSA_ISA_NAME``
1480  Specifies the target ISA name as a non-NUL terminated string.
1481
1482  This note record is not used by the HSA runtime loader.
1483
1484  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1485  V2's limited support of processors and fixed settings for target features.
1486
1487  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1488  from the string to the corresponding target ID. If the ``xnack`` target
1489  feature is supported and enabled, the string produced by the LLVM compiler
1490  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1491  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1492
1493``NT_AMD_HSA_METADATA``
1494  Specifies extensible metadata associated with the code objects executed on HSA
1495  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1496  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1497  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1498  metadata string.
1499
1500  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1501     :name: amdgpu-elf-note-record-supported_processors-v2-table
1502
1503     ===================== ==========================
1504     Note Record ISA Name  Target ID
1505     ===================== ==========================
1506     ``AMD:AMDGPU:6:0:0``  ``gfx600``
1507     ``AMD:AMDGPU:6:0:1``  ``gfx601``
1508     ``AMD:AMDGPU:6:0:2``  ``gfx602``
1509     ``AMD:AMDGPU:7:0:0``  ``gfx700``
1510     ``AMD:AMDGPU:7:0:1``  ``gfx701``
1511     ``AMD:AMDGPU:7:0:2``  ``gfx702``
1512     ``AMD:AMDGPU:7:0:3``  ``gfx703``
1513     ``AMD:AMDGPU:7:0:4``  ``gfx704``
1514     ``AMD:AMDGPU:7:0:5``  ``gfx705``
1515     ``AMD:AMDGPU:8:0:0``  ``gfx802``
1516     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1517     ``AMD:AMDGPU:8:0:2``  ``gfx802``
1518     ``AMD:AMDGPU:8:0:3``  ``gfx803``
1519     ``AMD:AMDGPU:8:0:4``  ``gfx803``
1520     ``AMD:AMDGPU:8:0:5``  ``gfx805``
1521     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1522     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1523     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1524     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1525     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1526     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1527     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1528     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1529     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1530     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1531     ===================== ==========================
1532
1533.. _amdgpu-note-records-v3-onwards:
1534
1535Code Object V3 and Above Note Records
1536~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1537
1538The AMDGPU backend code object uses the following ELF note record in the
1539``.note`` section when compiling for code object V3 and above.
1540
1541The note record vendor field is "AMDGPU".
1542
1543Additional note records may be present, but any which are not documented here
1544are deprecated and should not be used.
1545
1546  .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1547     :name: amdgpu-elf-note-records-table-v3-onwards
1548
1549     ======== ============================== ======================================
1550     Name     Type                           Description
1551     ======== ============================== ======================================
1552     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1553                                             binary format.
1554     ======== ============================== ======================================
1555
1556..
1557
1558  .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1559     :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1560
1561     ============================== =====
1562     Name                           Value
1563     ============================== =====
1564     *reserved*                     0-31
1565     ``NT_AMDGPU_METADATA``         32
1566     ============================== =====
1567
1568``NT_AMDGPU_METADATA``
1569  Specifies extensible metadata associated with an AMDGPU code object. It is
1570  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1571  :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1572  :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1573  :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1574  ``amdhsa`` OS.
1575
1576.. _amdgpu-symbols:
1577
1578Symbols
1579-------
1580
1581Symbols include the following:
1582
1583  .. table:: AMDGPU ELF Symbols
1584     :name: amdgpu-elf-symbols-table
1585
1586     ===================== ================== ================ ==================
1587     Name                  Type               Section          Description
1588     ===================== ================== ================ ==================
1589     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1590                                              - ``.rodata``
1591                                              - ``.bss``
1592     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1593     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1594     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1595     ===================== ================== ================ ==================
1596
1597Global variable
1598  Global variables both used and defined by the compilation unit.
1599
1600  If the symbol is defined in the compilation unit then it is allocated in the
1601  appropriate section according to if it has initialized data or is readonly.
1602
1603  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1604  will resolve relocations using the definition provided by another code object
1605  or explicitly defined by the runtime.
1606
1607  If the symbol resides in local/group memory (LDS) then its section is the
1608  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1609  ``st_value`` field describes alignment requirements as it does for common
1610  symbols.
1611
1612  .. TODO::
1613
1614     Add description of linked shared object symbols. Seems undefined symbols
1615     are marked as STT_NOTYPE.
1616
1617Kernel descriptor
1618  Every HSA kernel has an associated kernel descriptor. It is the address of the
1619  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1620  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1621  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1622
1623Kernel entry point
1624  Every HSA kernel also has a symbol for its machine code entry point.
1625
1626.. _amdgpu-relocation-records:
1627
1628Relocation Records
1629------------------
1630
1631AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1632relocatable fields are:
1633
1634``word32``
1635  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1636  alignment. These values use the same byte order as other word values in the
1637  AMDGPU architecture.
1638
1639``word64``
1640  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1641  alignment. These values use the same byte order as other word values in the
1642  AMDGPU architecture.
1643
1644Following notations are used for specifying relocation calculations:
1645
1646**A**
1647  Represents the addend used to compute the value of the relocatable field.
1648
1649**G**
1650  Represents the offset into the global offset table at which the relocation
1651  entry's symbol will reside during execution.
1652
1653**GOT**
1654  Represents the address of the global offset table.
1655
1656**P**
1657  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1658  of the storage unit being relocated (computed using ``r_offset``).
1659
1660**S**
1661  Represents the value of the symbol whose index resides in the relocation
1662  entry. Relocations not using this must specify a symbol index of
1663  ``STN_UNDEF``.
1664
1665**B**
1666  Represents the base address of a loaded executable or shared object which is
1667  the difference between the ELF address and the actual load address.
1668  Relocations using this are only valid in executable or shared objects.
1669
1670The following relocation types are supported:
1671
1672  .. table:: AMDGPU ELF Relocation Records
1673     :name: amdgpu-elf-relocation-records-table
1674
1675     ========================== ======= =====  ==========  ==============================
1676     Relocation Type            Kind    Value  Field       Calculation
1677     ========================== ======= =====  ==========  ==============================
1678     ``R_AMDGPU_NONE``                  0      *none*      *none*
1679     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1680                                Dynamic
1681     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1682                                Dynamic
1683     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1684                                Dynamic
1685     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1686     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1687     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1688                                Dynamic
1689     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1690     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1691     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1692     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1693     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1694     *reserved*                         12
1695     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1696     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1697     ========================== ======= =====  ==========  ==============================
1698
1699``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1700the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1701
1702There is no current OS loader support for 32-bit programs and so
1703``R_AMDGPU_ABS32`` is not used.
1704
1705.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1706
1707Loaded Code Object Path Uniform Resource Identifier (URI)
1708---------------------------------------------------------
1709
1710The AMD GPU code object loader represents the path of the ELF shared object from
1711which the code object was loaded as a textual Uniform Resource Identifier (URI).
1712Note that the code object is the in memory loaded relocated form of the ELF
1713shared object.  Multiple code objects may be loaded at different memory
1714addresses in the same process from the same ELF shared object.
1715
1716The loaded code object path URI syntax is defined by the following BNF syntax:
1717
1718.. code::
1719
1720  code_object_uri ::== file_uri | memory_uri
1721  file_uri        ::== "file://" file_path [ range_specifier ]
1722  memory_uri      ::== "memory://" process_id range_specifier
1723  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1724  file_path       ::== URI_ENCODED_OS_FILE_PATH
1725  process_id      ::== DECIMAL_NUMBER
1726  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1727
1728**number**
1729  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1730  and octal values by "0".
1731
1732**file_path**
1733  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1734  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1735  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1736  the path are separated by "/".
1737
1738**offset**
1739  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1740  is from the start of the file specified by the ``file_path``, and if omitted
1741  defaults to 0. For a memory URI, it is the memory address and is required.
1742
1743**size**
1744  Is the number of bytes in the code object.  For a file URI, if omitted it
1745  defaults to the size of the file.  It is required for a memory URI.
1746
1747**process_id**
1748  Is the identity of the process owning the memory.  For Linux it is the C
1749  unsigned integral decimal literal for the process ID (PID).
1750
1751For example:
1752
1753.. code::
1754
1755  file:///dir1/dir2/file1
1756  file:///dir3/dir4/file2#offset=0x2000&size=3000
1757  memory://1234#offset=0x20000&size=3000
1758
1759.. _amdgpu-dwarf-debug-information:
1760
1761DWARF Debug Information
1762=======================
1763
1764.. warning::
1765
1766   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1767   is not currently fully implemented and is subject to change.
1768
1769AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1770:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1771object executable code and data to the source language constructs. It can be
1772used by tools such as debuggers and profilers. It uses features defined in
1773:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1774DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1775
1776This section defines the AMDGPU target architecture specific DWARF mappings.
1777
1778.. _amdgpu-dwarf-register-identifier:
1779
1780Register Identifier
1781-------------------
1782
1783This section defines the AMDGPU target architecture register numbers used in
1784DWARF operation expressions (see DWARF Version 5 section 2.5 and
1785:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1786instructions (see DWARF Version 5 section 6.4 and
1787:ref:`amdgpu-dwarf-call-frame-information`).
1788
1789A single code object can contain code for kernels that have different wavefront
1790sizes. The vector registers and some scalar registers are based on the wavefront
1791size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1792simplifies the consumer of the DWARF so that each register has a fixed size,
1793rather than being dynamic according to the wavefront size mode. Similarly,
1794distinct DWARF registers are defined for those registers that vary in size
1795according to the process address size. This allows a consumer to treat a
1796specific AMDGPU processor as a single architecture regardless of how it is
1797configured at run time. The compiler explicitly specifies the DWARF registers
1798that match the mode in which the code it is generating will be executed.
1799
1800DWARF registers are encoded as numbers, which are mapped to architecture
1801registers. The mapping for AMDGPU is defined in
1802:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1803mapping.
1804
1805.. table:: AMDGPU DWARF Register Mapping
1806   :name: amdgpu-dwarf-register-mapping-table
1807
1808   ============== ================= ======== ==================================
1809   DWARF Register AMDGPU Register   Bit Size Description
1810   ============== ================= ======== ==================================
1811   0              PC_32             32       Program Counter (PC) when
1812                                             executing in a 32-bit process
1813                                             address space. Used in the CFI to
1814                                             describe the PC of the calling
1815                                             frame.
1816   1              EXEC_MASK_32      32       Execution Mask Register when
1817                                             executing in wavefront 32 mode.
1818   2-15           *Reserved*                 *Reserved for highly accessed
1819                                             registers using DWARF shortcut.*
1820   16             PC_64             64       Program Counter (PC) when
1821                                             executing in a 64-bit process
1822                                             address space. Used in the CFI to
1823                                             describe the PC of the calling
1824                                             frame.
1825   17             EXEC_MASK_64      64       Execution Mask Register when
1826                                             executing in wavefront 64 mode.
1827   18-31          *Reserved*                 *Reserved for highly accessed
1828                                             registers using DWARF shortcut.*
1829   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1830                                             Registers.
1831   96-127         *Reserved*                 *Reserved for frequently accessed
1832                                             registers using DWARF 1-byte ULEB.*
1833   128            STATUS            32       Status Register.
1834   129-511        *Reserved*                 *Reserved for future Scalar
1835                                             Architectural Registers.*
1836   512            VCC_32            32       Vector Condition Code Register
1837                                             when executing in wavefront 32
1838                                             mode.
1839   513-767        *Reserved*                 *Reserved for future Vector
1840                                             Architectural Registers when
1841                                             executing in wavefront 32 mode.*
1842   768            VCC_64            64       Vector Condition Code Register
1843                                             when executing in wavefront 64
1844                                             mode.
1845   769-1023       *Reserved*                 *Reserved for future Vector
1846                                             Architectural Registers when
1847                                             executing in wavefront 64 mode.*
1848   1024-1087      *Reserved*                 *Reserved for padding.*
1849   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1850   1130-1535      *Reserved*                 *Reserved for future Scalar
1851                                             General Purpose Registers.*
1852   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1853                                             when executing in wavefront 32
1854                                             mode.
1855   1792-2047      *Reserved*                 *Reserved for future Vector
1856                                             General Purpose Registers when
1857                                             executing in wavefront 32 mode.*
1858   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1859                                             when executing in wavefront 32
1860                                             mode.
1861   2304-2559      *Reserved*                 *Reserved for future Vector
1862                                             Accumulation Registers when
1863                                             executing in wavefront 32 mode.*
1864   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1865                                             when executing in wavefront 64
1866                                             mode.
1867   2816-3071      *Reserved*                 *Reserved for future Vector
1868                                             General Purpose Registers when
1869                                             executing in wavefront 64 mode.*
1870   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1871                                             when executing in wavefront 64
1872                                             mode.
1873   3328-3583      *Reserved*                 *Reserved for future Vector
1874                                             Accumulation Registers when
1875                                             executing in wavefront 64 mode.*
1876   ============== ================= ======== ==================================
1877
1878The vector registers are represented as the full size for the wavefront. They
1879are organized as consecutive dwords (32-bits), one per lane, with the dword at
1880the least significant bit position corresponding to lane 0 and so forth. DWARF
1881location expressions involving the ``DW_OP_LLVM_offset`` and
1882``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1883register corresponding to the lane that is executing the current thread of
1884execution in languages that are implemented using a SIMD or SIMT execution
1885model.
1886
1887If the wavefront size is 32 lanes then the wavefront 32 mode register
1888definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1889mode register definitions are used. Some AMDGPU targets support executing in
1890both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1891to the wavefront mode of the generated code will be used.
1892
1893If code is generated to execute in a 32-bit process address space, then the
189432-bit process address space register definitions are used. If code is generated
1895to execute in a 64-bit process address space, then the 64-bit process address
1896space register definitions are used. The ``amdgcn`` target only supports the
189764-bit process address space.
1898
1899.. _amdgpu-dwarf-address-class-identifier:
1900
1901Address Class Identifier
1902------------------------
1903
1904The DWARF address class represents the source language memory space. See DWARF
1905Version 5 section 2.12 which is updated by the *DWARF Extensions For
1906Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1907
1908The DWARF address class mapping used for AMDGPU is defined in
1909:ref:`amdgpu-dwarf-address-class-mapping-table`.
1910
1911.. table:: AMDGPU DWARF Address Class Mapping
1912   :name: amdgpu-dwarf-address-class-mapping-table
1913
1914   ========================= ====== =================
1915   DWARF                            AMDGPU
1916   -------------------------------- -----------------
1917   Address Class Name        Value  Address Space
1918   ========================= ====== =================
1919   ``DW_ADDR_none``          0x0000 Generic (Flat)
1920   ``DW_ADDR_LLVM_global``   0x0001 Global
1921   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1922   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1923   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1924   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1925   ========================= ====== =================
1926
1927The DWARF address class values defined in the *DWARF Extensions For
1928Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1929
1930In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1931available for use for the AMD extension for access to the hardware GDS memory
1932which is scratchpad memory allocated per device.
1933
1934For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1935address class of ``DW_ADDR_none`` is used.
1936
1937See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1938mapping of DWARF address classes to DWARF address spaces, including address size
1939and NULL value.
1940
1941.. _amdgpu-dwarf-address-space-identifier:
1942
1943Address Space Identifier
1944------------------------
1945
1946DWARF address spaces correspond to target architecture specific linear
1947addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1948For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1949
1950The DWARF address space mapping used for AMDGPU is defined in
1951:ref:`amdgpu-dwarf-address-space-mapping-table`.
1952
1953.. table:: AMDGPU DWARF Address Space Mapping
1954   :name: amdgpu-dwarf-address-space-mapping-table
1955
1956   ======================================= ===== ======= ======== ================= =======================
1957   DWARF                                                          AMDGPU            Notes
1958   --------------------------------------- ----- ---------------- ----------------- -----------------------
1959   Address Space Name                      Value Address Bit Size Address Space
1960   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1961   ..                                            64-bit  32-bit
1962                                                 process process
1963                                                 address address
1964                                                 space   space
1965   ======================================= ===== ======= ======== ================= =======================
1966   ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1967   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1968   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1969   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1970   *Reserved*                              0x04
1971   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1972   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1973   ======================================= ===== ======= ======== ================= =======================
1974
1975See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1976including address size and NULL value.
1977
1978The ``DW_ASPACE_none`` address space is the default target architecture address
1979space used in DWARF operations that do not specify an address space. It
1980therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1981related operations can refer to addresses in the program code.
1982
1983The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1984specify the flat address space. If the address corresponds to an address in the
1985local address space, then it corresponds to the wavefront that is executing the
1986focused thread of execution. If the address corresponds to an address in the
1987private address space, then it corresponds to the lane that is executing the
1988focused thread of execution for languages that are implemented using a SIMD or
1989SIMT execution model.
1990
1991.. note::
1992
1993  CUDA-like languages such as HIP that do not have address spaces in the
1994  language type system, but do allow variables to be allocated in different
1995  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1996  address space in the DWARF expression operations as the default address space
1997  is the global address space.
1998
1999The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2000specify the local address space corresponding to the wavefront that is executing
2001the focused thread of execution.
2002
2003The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2004to specify the private address space corresponding to the lane that is executing
2005the focused thread of execution for languages that are implemented using a SIMD
2006or SIMT execution model.
2007
2008The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2009to specify the unswizzled private address space corresponding to the wavefront
2010that is executing the focused thread of execution. The wavefront view of private
2011memory is the per wavefront unswizzled backing memory layout defined in
2012:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2013location for the backing memory of the wavefront (namely the address is not
2014offset by ``wavefront-scratch-base``). The following formula can be used to
2015convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2016``DW_ASPACE_AMDGPU_private_wave`` address:
2017
2018::
2019
2020  private-address-wavefront =
2021    ((private-address-lane / 4) * wavefront-size * 4) +
2022    (wavefront-lane-id * 4) + (private-address-lane % 4)
2023
2024If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2025of the dwords for each lane starting with lane 0 is required, then this
2026simplifies to:
2027
2028::
2029
2030  private-address-wavefront =
2031    private-address-lane * wavefront-size
2032
2033A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2034complete spilled vector register back into a complete vector register in the
2035CFI. The frame pointer can be a private lane address which is dword aligned,
2036which can be shifted to multiply by the wavefront size, and then used to form a
2037private wavefront address that gives a location for a contiguous set of dwords,
2038one per lane, where the vector register dwords are spilled. The compiler knows
2039the wavefront size since it generates the code. Note that the type of the
2040address may have to be converted as the size of a
2041``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2042``DW_ASPACE_AMDGPU_private_wave`` address.
2043
2044.. _amdgpu-dwarf-lane-identifier:
2045
2046Lane identifier
2047---------------
2048
2049DWARF lane identifies specify a target architecture lane position for hardware
2050that executes in a SIMD or SIMT manner, and on which a source language maps its
2051threads of execution onto those lanes. The DWARF lane identifier is pushed by
2052the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2053section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2054section :ref:`amdgpu-dwarf-operation-expressions`.
2055
2056For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2057wavefront. It is numbered from 0 to the wavefront size minus 1.
2058
2059Operation Expressions
2060---------------------
2061
2062DWARF expressions are used to compute program values and the locations of
2063program objects. See DWARF Version 5 section 2.5 and
2064:ref:`amdgpu-dwarf-operation-expressions`.
2065
2066DWARF location descriptions describe how to access storage which includes memory
2067and registers. When accessing storage on AMDGPU, bytes are ordered with least
2068significant bytes first, and bits are ordered within bytes with least
2069significant bits first.
2070
2071For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2072unwinding vector registers that are spilled under the execution mask to memory:
2073the zero-single location description is the vector register, and the one-single
2074location description is the spilled memory location description. The
2075``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2076memory location description.
2077
2078In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2079``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2080controlled by the execution mask. An undefined location description together
2081with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2082to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2083
2084Debugger Information Entry Attributes
2085-------------------------------------
2086
2087This section describes how certain debugger information entry attributes are
2088used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2089which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2090:ref:`amdgpu-dwarf-low-level-information` and
2091:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2092
2093.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2094
2095``DW_AT_LLVM_lane_pc``
2096~~~~~~~~~~~~~~~~~~~~~~
2097
2098For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2099location of the separate lanes of a SIMT thread.
2100
2101If the lane is an active lane then this will be the same as the current program
2102location.
2103
2104If the lane is inactive, but was active on entry to the subprogram, then this is
2105the program location in the subprogram at which execution of the lane is
2106conceptual positioned.
2107
2108If the lane was not active on entry to the subprogram, then this will be the
2109undefined location. A client debugger can check if the lane is part of a valid
2110work-group by checking that the lane is in the range of the associated
2111work-group within the grid, accounting for partial work-groups. If it is not,
2112then the debugger can omit any information for the lane. Otherwise, the debugger
2113may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2114calling subprogram until it finds a non-undefined location. Conceptually the
2115lane only has the call frames that it has a non-undefined
2116``DW_AT_LLVM_lane_pc``.
2117
2118The following example illustrates how the AMDGPU backend can generate a DWARF
2119location list expression for the nested ``IF/THEN/ELSE`` structures of the
2120following subprogram pseudo code for a target with 64 lanes per wavefront.
2121
2122.. code::
2123  :number-lines:
2124
2125  SUBPROGRAM X
2126  BEGIN
2127    a;
2128    IF (c1) THEN
2129      b;
2130      IF (c2) THEN
2131        c;
2132      ELSE
2133        d;
2134      ENDIF
2135      e;
2136    ELSE
2137      f;
2138    ENDIF
2139    g;
2140  END
2141
2142The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2143execution mask (``EXEC``) to linearize the control flow. The condition is
2144evaluated to make a mask of the lanes for which the condition evaluates to true.
2145First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2146logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2147``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2148the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2149region the ``EXEC`` mask is restored to the value it had at the beginning of the
2150region. This is shown below. Other approaches are possible, but the basic
2151concept is the same.
2152
2153.. code::
2154  :number-lines:
2155
2156  $lex_start:
2157    a;
2158    %1 = EXEC
2159    %2 = c1
2160  $lex_1_start:
2161    EXEC = %1 & %2
2162  $if_1_then:
2163      b;
2164      %3 = EXEC
2165      %4 = c2
2166  $lex_1_1_start:
2167      EXEC = %3 & %4
2168  $lex_1_1_then:
2169        c;
2170      EXEC = ~EXEC & %3
2171  $lex_1_1_else:
2172        d;
2173      EXEC = %3
2174  $lex_1_1_end:
2175      e;
2176    EXEC = ~EXEC & %1
2177  $lex_1_else:
2178      f;
2179    EXEC = %1
2180  $lex_1_end:
2181    g;
2182  $lex_end:
2183
2184To create the DWARF location list expression that defines the location
2185description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2186pseudo instruction can be used to annotate the linearized control flow. This can
2187be done by defining an artificial variable for the lane PC. The DWARF location
2188list expression created for it is used as the value of the
2189``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2190
2191A DWARF procedure is defined for each well nested structured control flow region
2192which provides the conceptual lane program location for a lane if it is not
2193active (namely it is divergent). The DWARF operation expression for each region
2194conceptually inherits the value of the immediately enclosing region and modifies
2195it according to the semantics of the region.
2196
2197For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2198the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2199region the divergent program location is at the end of the ``IF/THEN/ELSE``
2200region since the ``THEN`` region has completed.
2201
2202The lane PC artificial variable is assigned at each region transition. It uses
2203the immediately enclosing region's DWARF procedure to compute the program
2204location for each lane assuming they are divergent, and then modifies the result
2205by inserting the current program location for each lane that the ``EXEC`` mask
2206indicates is active.
2207
2208By having separate DWARF procedures for each region, they can be reused to
2209define the value for any nested region. This reduces the total size of the DWARF
2210operation expressions.
2211
2212The following provides an example using pseudo LLVM MIR.
2213
2214.. code::
2215  :number-lines:
2216
2217  $lex_start:
2218    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2219      DW_AT_name = "__uint64";
2220      DW_AT_byte_size = 8;
2221      DW_AT_encoding = DW_ATE_unsigned;
2222    ];
2223    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2224      DW_AT_name = "__active_lane_pc";
2225      DW_AT_location = [
2226        DW_OP_regx PC;
2227        DW_OP_LLVM_extend 64, 64;
2228        DW_OP_regval_type EXEC, %uint_64;
2229        DW_OP_LLVM_select_bit_piece 64, 64;
2230      ];
2231    ];
2232    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2233      DW_AT_name = "__divergent_lane_pc";
2234      DW_AT_location = [
2235        DW_OP_LLVM_undefined;
2236        DW_OP_LLVM_extend 64, 64;
2237      ];
2238    ];
2239    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2240      DW_OP_call_ref %__divergent_lane_pc;
2241      DW_OP_call_ref %__active_lane_pc;
2242    ];
2243    a;
2244    %1 = EXEC;
2245    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2246    %2 = c1;
2247  $lex_1_start:
2248    EXEC = %1 & %2;
2249  $lex_1_then:
2250      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2251        DW_AT_name = "__divergent_lane_pc_1_then";
2252        DW_AT_location = DIExpression[
2253          DW_OP_call_ref %__divergent_lane_pc;
2254          DW_OP_addrx &lex_1_start;
2255          DW_OP_stack_value;
2256          DW_OP_LLVM_extend 64, 64;
2257          DW_OP_call_ref %__lex_1_save_exec;
2258          DW_OP_deref_type 64, %__uint_64;
2259          DW_OP_LLVM_select_bit_piece 64, 64;
2260        ];
2261      ];
2262      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2263        DW_OP_call_ref %__divergent_lane_pc_1_then;
2264        DW_OP_call_ref %__active_lane_pc;
2265      ];
2266      b;
2267      %3 = EXEC;
2268      DBG_VALUE %3, %__lex_1_1_save_exec;
2269      %4 = c2;
2270  $lex_1_1_start:
2271      EXEC = %3 & %4;
2272  $lex_1_1_then:
2273        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2274          DW_AT_name = "__divergent_lane_pc_1_1_then";
2275          DW_AT_location = DIExpression[
2276            DW_OP_call_ref %__divergent_lane_pc_1_then;
2277            DW_OP_addrx &lex_1_1_start;
2278            DW_OP_stack_value;
2279            DW_OP_LLVM_extend 64, 64;
2280            DW_OP_call_ref %__lex_1_1_save_exec;
2281            DW_OP_deref_type 64, %__uint_64;
2282            DW_OP_LLVM_select_bit_piece 64, 64;
2283          ];
2284        ];
2285        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2286          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2287          DW_OP_call_ref %__active_lane_pc;
2288        ];
2289        c;
2290      EXEC = ~EXEC & %3;
2291  $lex_1_1_else:
2292        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2293          DW_AT_name = "__divergent_lane_pc_1_1_else";
2294          DW_AT_location = DIExpression[
2295            DW_OP_call_ref %__divergent_lane_pc_1_then;
2296            DW_OP_addrx &lex_1_1_end;
2297            DW_OP_stack_value;
2298            DW_OP_LLVM_extend 64, 64;
2299            DW_OP_call_ref %__lex_1_1_save_exec;
2300            DW_OP_deref_type 64, %__uint_64;
2301            DW_OP_LLVM_select_bit_piece 64, 64;
2302          ];
2303        ];
2304        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2305          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2306          DW_OP_call_ref %__active_lane_pc;
2307        ];
2308        d;
2309      EXEC = %3;
2310  $lex_1_1_end:
2311      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2312        DW_OP_call_ref %__divergent_lane_pc;
2313        DW_OP_call_ref %__active_lane_pc;
2314      ];
2315      e;
2316    EXEC = ~EXEC & %1;
2317  $lex_1_else:
2318      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2319        DW_AT_name = "__divergent_lane_pc_1_else";
2320        DW_AT_location = DIExpression[
2321          DW_OP_call_ref %__divergent_lane_pc;
2322          DW_OP_addrx &lex_1_end;
2323          DW_OP_stack_value;
2324          DW_OP_LLVM_extend 64, 64;
2325          DW_OP_call_ref %__lex_1_save_exec;
2326          DW_OP_deref_type 64, %__uint_64;
2327          DW_OP_LLVM_select_bit_piece 64, 64;
2328        ];
2329      ];
2330      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2331        DW_OP_call_ref %__divergent_lane_pc_1_else;
2332        DW_OP_call_ref %__active_lane_pc;
2333      ];
2334      f;
2335    EXEC = %1;
2336  $lex_1_end:
2337    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2338      DW_OP_call_ref %__divergent_lane_pc;
2339      DW_OP_call_ref %__active_lane_pc;
2340    ];
2341    g;
2342  $lex_end:
2343
2344The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2345that are active, with the current program location.
2346
2347Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2348the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2349instruction, location list entries will be created that describe where the
2350artificial variables are allocated at any given program location. The compiler
2351may allocate them to registers or spill them to memory.
2352
2353The DWARF procedures for each region use the values of the saved execution mask
2354artificial variables to only update the lanes that are active on entry to the
2355region. All other lanes retain the value of the enclosing region where they were
2356last active. If they were not active on entry to the subprogram, then will have
2357the undefined location description.
2358
2359Other structured control flow regions can be handled similarly. For example,
2360loops would set the divergent program location for the region at the end of the
2361loop. Any lanes active will be in the loop, and any lanes not active must have
2362exited the loop.
2363
2364An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2365``IF/THEN/ELSE`` regions.
2366
2367The DWARF procedures can use the active lane artificial variable described in
2368:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2369``EXEC`` mask in order to support whole or quad wavefront mode.
2370
2371.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2372
2373``DW_AT_LLVM_active_lane``
2374~~~~~~~~~~~~~~~~~~~~~~~~~~
2375
2376The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2377entry is used to specify the lanes that are conceptually active for a SIMT
2378thread.
2379
2380The execution mask may be modified to implement whole or quad wavefront mode
2381operations. For example, all lanes may need to temporarily be made active to
2382execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2383update it to enable the necessary lanes, perform the operations, and then
2384restore the ``EXEC`` mask from the saved value. While executing the whole
2385wavefront region, the conceptual execution mask is the saved value, not the
2386``EXEC`` value.
2387
2388This is handled by defining an artificial variable for the active lane mask. The
2389active lane mask artificial variable would be the actual ``EXEC`` mask for
2390normal regions, and the saved execution mask for regions where the mask is
2391temporarily updated. The location list expression created for this artificial
2392variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2393attribute.
2394
2395``DW_AT_LLVM_augmentation``
2396~~~~~~~~~~~~~~~~~~~~~~~~~~~
2397
2398For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2399debugger information entry has the following value for the augmentation string:
2400
2401::
2402
2403  [amdgpu:v0.0]
2404
2405The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2406extensions used in the DWARF of the compilation unit. The version number
2407conforms to [SEMVER]_.
2408
2409Call Frame Information
2410----------------------
2411
2412DWARF Call Frame Information (CFI) describes how a consumer can virtually
2413*unwind* call frames in a running process or core dump. See DWARF Version 5
2414section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2415
2416For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2417
24181.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2419
2420    ::
2421
2422      [amd:v0.0]
2423
2424    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2425    extensions used in this CIE or to the FDEs that use it. The version number
2426    conforms to [SEMVER]_.
2427
24282.  ``address_size`` for the ``Global`` address space is defined in
2429    :ref:`amdgpu-dwarf-address-space-identifier`.
2430
24313.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2432
24334.  ``code_alignment_factor`` is 4 bytes.
2434
2435    .. TODO::
2436
2437       Add to :ref:`amdgpu-processor-table` table.
2438
24395.  ``data_alignment_factor`` is 4 bytes.
2440
2441    .. TODO::
2442
2443       Add to :ref:`amdgpu-processor-table` table.
2444
24456.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2446    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2447
24487.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2449    called from subprogram Y that has more allocated, X will not change any of
2450    the extra registers as it cannot access them. Therefore, the default rule
2451    for all columns is ``same value``.
2452
2453For AMDGPU the register number follows the numbering defined in
2454:ref:`amdgpu-dwarf-register-identifier`.
2455
2456For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2457the return address to get the address of a byte within the call site
2458instructions. See DWARF Version 5 section 6.4.4.
2459
2460Accelerated Access
2461------------------
2462
2463See DWARF Version 5 section 6.1.
2464
2465Lookup By Name Section Header
2466~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2467
2468See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2469
2470For AMDGPU the lookup by name section header table:
2471
2472``augmentation_string_size`` (uword)
2473
2474  Set to the length of the ``augmentation_string`` value which is always a
2475  multiple of 4.
2476
2477``augmentation_string`` (sequence of UTF-8 characters)
2478
2479  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2480
2481  ::
2482
2483    [amdgpu:v0.0]
2484
2485  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2486  extensions used in the DWARF of this index. The version number conforms to
2487  [SEMVER]_.
2488
2489  .. note::
2490
2491    This is different to the DWARF Version 5 definition that requires the first
2492    4 characters to be the vendor ID. But this is consistent with the other
2493    augmentation strings and does allow multiple vendor contributions. However,
2494    backwards compatibility may be more desirable.
2495
2496Lookup By Address Section Header
2497~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2498
2499See DWARF Version 5 section 6.1.2.
2500
2501For AMDGPU the lookup by address section header table:
2502
2503``address_size`` (ubyte)
2504
2505  Match the address size for the ``Global`` address space defined in
2506  :ref:`amdgpu-dwarf-address-space-identifier`.
2507
2508``segment_selector_size`` (ubyte)
2509
2510  AMDGPU does not use a segment selector so this is 0. The entries in the
2511  ``.debug_aranges`` do not have a segment selector.
2512
2513Line Number Information
2514-----------------------
2515
2516See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2517
2518AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2519The instruction set must be obtained from the ELF file header ``e_flags`` field
2520in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2521<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2522
2523.. TODO::
2524
2525  Should the ``isa`` state machine register be used to indicate if the code is
2526  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2527
2528For AMDGPU the line number program header fields have the following values (see
2529DWARF Version 5 section 6.2.4):
2530
2531``address_size`` (ubyte)
2532  Matches the address size for the ``Global`` address space defined in
2533  :ref:`amdgpu-dwarf-address-space-identifier`.
2534
2535``segment_selector_size`` (ubyte)
2536  AMDGPU does not use a segment selector so this is 0.
2537
2538``minimum_instruction_length`` (ubyte)
2539  For GFX9-GFX11 this is 4.
2540
2541``maximum_operations_per_instruction`` (ubyte)
2542  For GFX9-GFX11 this is 1.
2543
2544Source text for online-compiled programs (for example, those compiled by the
2545OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2546See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2547Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2548<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2549
2550The Clang option used to control source embedding in AMDGPU is defined in
2551:ref:`amdgpu-clang-debug-options-table`.
2552
2553  .. table:: AMDGPU Clang Debug Options
2554     :name: amdgpu-clang-debug-options-table
2555
2556     ==================== ==================================================
2557     Debug Flag           Description
2558     ==================== ==================================================
2559     -g[no-]embed-source  Enable/disable embedding source text in DWARF
2560                          debug sections. Useful for environments where
2561                          source cannot be written to disk, such as
2562                          when performing online compilation.
2563     ==================== ==================================================
2564
2565For example:
2566
2567``-gembed-source``
2568  Enable the embedded source.
2569
2570``-gno-embed-source``
2571  Disable the embedded source.
2572
257332-Bit and 64-Bit DWARF Formats
2574-------------------------------
2575
2576See DWARF Version 5 section 7.4 and
2577:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2578
2579For AMDGPU:
2580
2581* For the ``amdgcn`` target architecture only the 64-bit process address space
2582  is supported.
2583
2584* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2585  the 32-bit DWARF format.
2586
2587Unit Headers
2588------------
2589
2590For AMDGPU the following values apply for each of the unit headers described in
2591DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2592
2593``address_size`` (ubyte)
2594  Matches the address size for the ``Global`` address space defined in
2595  :ref:`amdgpu-dwarf-address-space-identifier`.
2596
2597.. _amdgpu-code-conventions:
2598
2599Code Conventions
2600================
2601
2602This section provides code conventions used for each supported target triple OS
2603(see :ref:`amdgpu-target-triples`).
2604
2605AMDHSA
2606------
2607
2608This section provides code conventions used when the target triple OS is
2609``amdhsa`` (see :ref:`amdgpu-target-triples`).
2610
2611.. _amdgpu-amdhsa-code-object-metadata:
2612
2613Code Object Metadata
2614~~~~~~~~~~~~~~~~~~~~
2615
2616The code object metadata specifies extensible metadata associated with the code
2617objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2618encoding and semantics of this metadata depends on the code object version; see
2619:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2620:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2621:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2622:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2623
2624Code object metadata is specified in a note record (see
2625:ref:`amdgpu-note-records`) and is required when the target triple OS is
2626``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2627information necessary to support the HSA compatible runtime kernel queries. For
2628example, the segment sizes needed in a dispatch packet. In addition, a
2629high-level language runtime may require other information to be included. For
2630example, the AMD OpenCL runtime records kernel argument information.
2631
2632.. _amdgpu-amdhsa-code-object-metadata-v2:
2633
2634Code Object V2 Metadata
2635+++++++++++++++++++++++
2636
2637.. warning::
2638  Code object V2 is not the default code object version emitted by this version
2639  of LLVM.
2640
2641Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2642(see :ref:`amdgpu-note-records-v2`).
2643
2644The metadata is specified as a YAML formatted string (see [YAML]_ and
2645:doc:`YamlIO`).
2646
2647.. TODO::
2648
2649  Is the string null terminated? It probably should not if YAML allows it to
2650  contain null characters, otherwise it should be.
2651
2652The metadata is represented as a single YAML document comprised of the mapping
2653defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2654referenced tables.
2655
2656For boolean values, the string values of ``false`` and ``true`` are used for
2657false and true respectively.
2658
2659Additional information can be added to the mappings. To avoid conflicts, any
2660non-AMD key names should be prefixed by "*vendor-name*.".
2661
2662  .. table:: AMDHSA Code Object V2 Metadata Map
2663     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2664
2665     ========== ============== ========= =======================================
2666     String Key Value Type     Required? Description
2667     ========== ============== ========= =======================================
2668     "Version"  sequence of    Required  - The first integer is the major
2669                2 integers                 version. Currently 1.
2670                                         - The second integer is the minor
2671                                           version. Currently 0.
2672     "Printf"   sequence of              Each string is encoded information
2673                strings                  about a printf function call. The
2674                                         encoded information is organized as
2675                                         fields separated by colon (':'):
2676
2677                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2678
2679                                         where:
2680
2681                                         ``ID``
2682                                           A 32-bit integer as a unique id for
2683                                           each printf function call
2684
2685                                         ``N``
2686                                           A 32-bit integer equal to the number
2687                                           of arguments of printf function call
2688                                           minus 1
2689
2690                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2691                                           32-bit integers for the size in bytes
2692                                           of the i-th FormatString argument of
2693                                           the printf function call
2694
2695                                         FormatString
2696                                           The format string passed to the
2697                                           printf function call.
2698     "Kernels"  sequence of    Required  Sequence of the mappings for each
2699                mapping                  kernel in the code object. See
2700                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2701                                         for the definition of the mapping.
2702     ========== ============== ========= =======================================
2703
2704..
2705
2706  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2707     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2708
2709     ================= ============== ========= ================================
2710     String Key        Value Type     Required? Description
2711     ================= ============== ========= ================================
2712     "Name"            string         Required  Source name of the kernel.
2713     "SymbolName"      string         Required  Name of the kernel
2714                                                descriptor ELF symbol.
2715     "Language"        string                   Source language of the kernel.
2716                                                Values include:
2717
2718                                                - "OpenCL C"
2719                                                - "OpenCL C++"
2720                                                - "HCC"
2721                                                - "OpenMP"
2722
2723     "LanguageVersion" sequence of              - The first integer is the major
2724                       2 integers                 version.
2725                                                - The second integer is the
2726                                                  minor version.
2727     "Attrs"           mapping                  Mapping of kernel attributes.
2728                                                See
2729                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2730                                                for the mapping definition.
2731     "Args"            sequence of              Sequence of mappings of the
2732                       mapping                  kernel arguments. See
2733                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2734                                                for the definition of the mapping.
2735     "CodeProps"       mapping                  Mapping of properties related to
2736                                                the kernel code. See
2737                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2738                                                for the mapping definition.
2739     ================= ============== ========= ================================
2740
2741..
2742
2743  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2744     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2745
2746     =================== ============== ========= ==============================
2747     String Key          Value Type     Required? Description
2748     =================== ============== ========= ==============================
2749     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2750                         3 integers               must be >=1 and the dispatch
2751                                                  work-group size X, Y, Z must
2752                                                  correspond to the specified
2753                                                  values. Defaults to 0, 0, 0.
2754
2755                                                  Corresponds to the OpenCL
2756                                                  ``reqd_work_group_size``
2757                                                  attribute.
2758     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2759                         3 integers               X, Y, Z is likely to be the
2760                                                  specified values.
2761
2762                                                  Corresponds to the OpenCL
2763                                                  ``work_group_size_hint``
2764                                                  attribute.
2765     "VecTypeHint"       string                   The name of a scalar or vector
2766                                                  type.
2767
2768                                                  Corresponds to the OpenCL
2769                                                  ``vec_type_hint`` attribute.
2770
2771     "RuntimeHandle"     string                   The external symbol name
2772                                                  associated with a kernel.
2773                                                  OpenCL runtime allocates a
2774                                                  global buffer for the symbol
2775                                                  and saves the kernel's address
2776                                                  to it, which is used for
2777                                                  device side enqueueing. Only
2778                                                  available for device side
2779                                                  enqueued kernels.
2780     =================== ============== ========= ==============================
2781
2782..
2783
2784  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2785     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2786
2787     ================= ============== ========= ================================
2788     String Key        Value Type     Required? Description
2789     ================= ============== ========= ================================
2790     "Name"            string                   Kernel argument name.
2791     "TypeName"        string                   Kernel argument type name.
2792     "Size"            integer        Required  Kernel argument size in bytes.
2793     "Align"           integer        Required  Kernel argument alignment in
2794                                                bytes. Must be a power of two.
2795     "ValueKind"       string         Required  Kernel argument kind that
2796                                                specifies how to set up the
2797                                                corresponding argument.
2798                                                Values include:
2799
2800                                                "ByValue"
2801                                                  The argument is copied
2802                                                  directly into the kernarg.
2803
2804                                                "GlobalBuffer"
2805                                                  A global address space pointer
2806                                                  to the buffer data is passed
2807                                                  in the kernarg.
2808
2809                                                "DynamicSharedPointer"
2810                                                  A group address space pointer
2811                                                  to dynamically allocated LDS
2812                                                  is passed in the kernarg.
2813
2814                                                "Sampler"
2815                                                  A global address space
2816                                                  pointer to a S# is passed in
2817                                                  the kernarg.
2818
2819                                                "Image"
2820                                                  A global address space
2821                                                  pointer to a T# is passed in
2822                                                  the kernarg.
2823
2824                                                "Pipe"
2825                                                  A global address space pointer
2826                                                  to an OpenCL pipe is passed in
2827                                                  the kernarg.
2828
2829                                                "Queue"
2830                                                  A global address space pointer
2831                                                  to an OpenCL device enqueue
2832                                                  queue is passed in the
2833                                                  kernarg.
2834
2835                                                "HiddenGlobalOffsetX"
2836                                                  The OpenCL grid dispatch
2837                                                  global offset for the X
2838                                                  dimension is passed in the
2839                                                  kernarg.
2840
2841                                                "HiddenGlobalOffsetY"
2842                                                  The OpenCL grid dispatch
2843                                                  global offset for the Y
2844                                                  dimension is passed in the
2845                                                  kernarg.
2846
2847                                                "HiddenGlobalOffsetZ"
2848                                                  The OpenCL grid dispatch
2849                                                  global offset for the Z
2850                                                  dimension is passed in the
2851                                                  kernarg.
2852
2853                                                "HiddenNone"
2854                                                  An argument that is not used
2855                                                  by the kernel. Space needs to
2856                                                  be left for it, but it does
2857                                                  not need to be set up.
2858
2859                                                "HiddenPrintfBuffer"
2860                                                  A global address space pointer
2861                                                  to the runtime printf buffer
2862                                                  is passed in kernarg. Mutually
2863                                                  exclusive with
2864                                                  "HiddenHostcallBuffer".
2865
2866                                                "HiddenHostcallBuffer"
2867                                                  A global address space pointer
2868                                                  to the runtime hostcall buffer
2869                                                  is passed in kernarg. Mutually
2870                                                  exclusive with
2871                                                  "HiddenPrintfBuffer".
2872
2873                                                "HiddenDefaultQueue"
2874                                                  A global address space pointer
2875                                                  to the OpenCL device enqueue
2876                                                  queue that should be used by
2877                                                  the kernel by default is
2878                                                  passed in the kernarg.
2879
2880                                                "HiddenCompletionAction"
2881                                                  A global address space pointer
2882                                                  to help link enqueued kernels into
2883                                                  the ancestor tree for determining
2884                                                  when the parent kernel has finished.
2885
2886                                                "HiddenMultiGridSyncArg"
2887                                                  A global address space pointer for
2888                                                  multi-grid synchronization is
2889                                                  passed in the kernarg.
2890
2891     "ValueType"       string                   Unused and deprecated. This should no longer
2892                                                be emitted, but is accepted for compatibility.
2893
2894
2895     "PointeeAlign"    integer                  Alignment in bytes of pointee
2896                                                type for pointer type kernel
2897                                                argument. Must be a power
2898                                                of 2. Only present if
2899                                                "ValueKind" is
2900                                                "DynamicSharedPointer".
2901     "AddrSpaceQual"   string                   Kernel argument address space
2902                                                qualifier. Only present if
2903                                                "ValueKind" is "GlobalBuffer" or
2904                                                "DynamicSharedPointer". Values
2905                                                are:
2906
2907                                                - "Private"
2908                                                - "Global"
2909                                                - "Constant"
2910                                                - "Local"
2911                                                - "Generic"
2912                                                - "Region"
2913
2914                                                .. TODO::
2915
2916                                                   Is GlobalBuffer only Global
2917                                                   or Constant? Is
2918                                                   DynamicSharedPointer always
2919                                                   Local? Can HCC allow Generic?
2920                                                   How can Private or Region
2921                                                   ever happen?
2922
2923     "AccQual"         string                   Kernel argument access
2924                                                qualifier. Only present if
2925                                                "ValueKind" is "Image" or
2926                                                "Pipe". Values
2927                                                are:
2928
2929                                                - "ReadOnly"
2930                                                - "WriteOnly"
2931                                                - "ReadWrite"
2932
2933                                                .. TODO::
2934
2935                                                   Does this apply to
2936                                                   GlobalBuffer?
2937
2938     "ActualAccQual"   string                   The actual memory accesses
2939                                                performed by the kernel on the
2940                                                kernel argument. Only present if
2941                                                "ValueKind" is "GlobalBuffer",
2942                                                "Image", or "Pipe". This may be
2943                                                more restrictive than indicated
2944                                                by "AccQual" to reflect what the
2945                                                kernel actual does. If not
2946                                                present then the runtime must
2947                                                assume what is implied by
2948                                                "AccQual" and "IsConst". Values
2949                                                are:
2950
2951                                                - "ReadOnly"
2952                                                - "WriteOnly"
2953                                                - "ReadWrite"
2954
2955     "IsConst"         boolean                  Indicates if the kernel argument
2956                                                is const qualified. Only present
2957                                                if "ValueKind" is
2958                                                "GlobalBuffer".
2959
2960     "IsRestrict"      boolean                  Indicates if the kernel argument
2961                                                is restrict qualified. Only
2962                                                present if "ValueKind" is
2963                                                "GlobalBuffer".
2964
2965     "IsVolatile"      boolean                  Indicates if the kernel argument
2966                                                is volatile qualified. Only
2967                                                present if "ValueKind" is
2968                                                "GlobalBuffer".
2969
2970     "IsPipe"          boolean                  Indicates if the kernel argument
2971                                                is pipe qualified. Only present
2972                                                if "ValueKind" is "Pipe".
2973
2974                                                .. TODO::
2975
2976                                                   Can GlobalBuffer be pipe
2977                                                   qualified?
2978
2979     ================= ============== ========= ================================
2980
2981..
2982
2983  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2984     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2985
2986     ============================ ============== ========= =====================
2987     String Key                   Value Type     Required? Description
2988     ============================ ============== ========= =====================
2989     "KernargSegmentSize"         integer        Required  The size in bytes of
2990                                                           the kernarg segment
2991                                                           that holds the values
2992                                                           of the arguments to
2993                                                           the kernel.
2994     "GroupSegmentFixedSize"      integer        Required  The amount of group
2995                                                           segment memory
2996                                                           required by a
2997                                                           work-group in
2998                                                           bytes. This does not
2999                                                           include any
3000                                                           dynamically allocated
3001                                                           group segment memory
3002                                                           that may be added
3003                                                           when the kernel is
3004                                                           dispatched.
3005     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
3006                                                           private address space
3007                                                           memory required for a
3008                                                           work-item in
3009                                                           bytes. If the kernel
3010                                                           uses a dynamic call
3011                                                           stack then additional
3012                                                           space must be added
3013                                                           to this value for the
3014                                                           call stack.
3015     "KernargSegmentAlign"        integer        Required  The maximum byte
3016                                                           alignment of
3017                                                           arguments in the
3018                                                           kernarg segment. Must
3019                                                           be a power of 2.
3020     "WavefrontSize"              integer        Required  Wavefront size. Must
3021                                                           be a power of 2.
3022     "NumSGPRs"                   integer        Required  Number of scalar
3023                                                           registers used by a
3024                                                           wavefront for
3025                                                           GFX6-GFX11. This
3026                                                           includes the special
3027                                                           SGPRs for VCC, Flat
3028                                                           Scratch (GFX7-GFX10)
3029                                                           and XNACK (for
3030                                                           GFX8-GFX10). It does
3031                                                           not include the 16
3032                                                           SGPR added if a trap
3033                                                           handler is
3034                                                           enabled. It is not
3035                                                           rounded up to the
3036                                                           allocation
3037                                                           granularity.
3038     "NumVGPRs"                   integer        Required  Number of vector
3039                                                           registers used by
3040                                                           each work-item for
3041                                                           GFX6-GFX11
3042     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
3043                                                           work-group size
3044                                                           supported by the
3045                                                           kernel in work-items.
3046                                                           Must be >=1 and
3047                                                           consistent with
3048                                                           ReqdWorkGroupSize if
3049                                                           not 0, 0, 0.
3050     "NumSpilledSGPRs"            integer                  Number of stores from
3051                                                           a scalar register to
3052                                                           a register allocator
3053                                                           created spill
3054                                                           location.
3055     "NumSpilledVGPRs"            integer                  Number of stores from
3056                                                           a vector register to
3057                                                           a register allocator
3058                                                           created spill
3059                                                           location.
3060     ============================ ============== ========= =====================
3061
3062.. _amdgpu-amdhsa-code-object-metadata-v3:
3063
3064Code Object V3 Metadata
3065+++++++++++++++++++++++
3066
3067.. warning::
3068  Code object V3 is not the default code object version emitted by this version
3069  of LLVM.
3070
3071Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3072record (see :ref:`amdgpu-note-records-v3-onwards`).
3073
3074The metadata is represented as Message Pack formatted binary data (see
3075[MsgPack]_). The top level is a Message Pack map that includes the
3076keys defined in table
3077:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3078tables.
3079
3080Additional information can be added to the maps. To avoid conflicts,
3081any key names should be prefixed by "*vendor-name*." where
3082``vendor-name`` can be the name of the vendor and specific vendor
3083tool that generates the information. The prefix is abbreviated to
3084simply "." when it appears within a map that has been added by the
3085same *vendor-name*.
3086
3087  .. table:: AMDHSA Code Object V3 Metadata Map
3088     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3089
3090     ================= ============== ========= =======================================
3091     String Key        Value Type     Required? Description
3092     ================= ============== ========= =======================================
3093     "amdhsa.version"  sequence of    Required  - The first integer is the major
3094                       2 integers                 version. Currently 1.
3095                                                - The second integer is the minor
3096                                                  version. Currently 0.
3097     "amdhsa.printf"   sequence of              Each string is encoded information
3098                       strings                  about a printf function call. The
3099                                                encoded information is organized as
3100                                                fields separated by colon (':'):
3101
3102                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3103
3104                                                where:
3105
3106                                                ``ID``
3107                                                  A 32-bit integer as a unique id for
3108                                                  each printf function call
3109
3110                                                ``N``
3111                                                  A 32-bit integer equal to the number
3112                                                  of arguments of printf function call
3113                                                  minus 1
3114
3115                                                ``S[i]`` (where i = 0, 1, ... , N-1)
3116                                                  32-bit integers for the size in bytes
3117                                                  of the i-th FormatString argument of
3118                                                  the printf function call
3119
3120                                                FormatString
3121                                                  The format string passed to the
3122                                                  printf function call.
3123     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3124                       map                      kernel in the code object. See
3125                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3126                                                for the definition of the keys included
3127                                                in that map.
3128     ================= ============== ========= =======================================
3129
3130..
3131
3132  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3133     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3134
3135     =================================== ============== ========= ================================
3136     String Key                          Value Type     Required? Description
3137     =================================== ============== ========= ================================
3138     ".name"                             string         Required  Source name of the kernel.
3139     ".symbol"                           string         Required  Name of the kernel
3140                                                                  descriptor ELF symbol.
3141     ".language"                         string                   Source language of the kernel.
3142                                                                  Values include:
3143
3144                                                                  - "OpenCL C"
3145                                                                  - "OpenCL C++"
3146                                                                  - "HCC"
3147                                                                  - "HIP"
3148                                                                  - "OpenMP"
3149                                                                  - "Assembler"
3150
3151     ".language_version"                 sequence of              - The first integer is the major
3152                                         2 integers                 version.
3153                                                                  - The second integer is the
3154                                                                    minor version.
3155     ".args"                             sequence of              Sequence of maps of the
3156                                         map                      kernel arguments. See
3157                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3158                                                                  for the definition of the keys
3159                                                                  included in that map.
3160     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3161                                         3 integers               must be >=1 and the dispatch
3162                                                                  work-group size X, Y, Z must
3163                                                                  correspond to the specified
3164                                                                  values. Defaults to 0, 0, 0.
3165
3166                                                                  Corresponds to the OpenCL
3167                                                                  ``reqd_work_group_size``
3168                                                                  attribute.
3169     ".workgroup_size_hint"              sequence of              The dispatch work-group size
3170                                         3 integers               X, Y, Z is likely to be the
3171                                                                  specified values.
3172
3173                                                                  Corresponds to the OpenCL
3174                                                                  ``work_group_size_hint``
3175                                                                  attribute.
3176     ".vec_type_hint"                    string                   The name of a scalar or vector
3177                                                                  type.
3178
3179                                                                  Corresponds to the OpenCL
3180                                                                  ``vec_type_hint`` attribute.
3181
3182     ".device_enqueue_symbol"            string                   The external symbol name
3183                                                                  associated with a kernel.
3184                                                                  OpenCL runtime allocates a
3185                                                                  global buffer for the symbol
3186                                                                  and saves the kernel's address
3187                                                                  to it, which is used for
3188                                                                  device side enqueueing. Only
3189                                                                  available for device side
3190                                                                  enqueued kernels.
3191     ".kernarg_segment_size"             integer        Required  The size in bytes of
3192                                                                  the kernarg segment
3193                                                                  that holds the values
3194                                                                  of the arguments to
3195                                                                  the kernel.
3196     ".group_segment_fixed_size"         integer        Required  The amount of group
3197                                                                  segment memory
3198                                                                  required by a
3199                                                                  work-group in
3200                                                                  bytes. This does not
3201                                                                  include any
3202                                                                  dynamically allocated
3203                                                                  group segment memory
3204                                                                  that may be added
3205                                                                  when the kernel is
3206                                                                  dispatched.
3207     ".private_segment_fixed_size"       integer        Required  The amount of fixed
3208                                                                  private address space
3209                                                                  memory required for a
3210                                                                  work-item in
3211                                                                  bytes. If the kernel
3212                                                                  uses a dynamic call
3213                                                                  stack then additional
3214                                                                  space must be added
3215                                                                  to this value for the
3216                                                                  call stack.
3217     ".kernarg_segment_align"            integer        Required  The maximum byte
3218                                                                  alignment of
3219                                                                  arguments in the
3220                                                                  kernarg segment. Must
3221                                                                  be a power of 2.
3222     ".uses_dynamic_stack"               boolean                  Indicates if the generated
3223                                                                  machine code is using a
3224                                                                  dynamically sized stack.
3225     ".wavefront_size"                   integer        Required  Wavefront size. Must
3226                                                                  be a power of 2.
3227     ".sgpr_count"                       integer        Required  Number of scalar
3228                                                                  registers required by a
3229                                                                  wavefront for
3230                                                                  GFX6-GFX9. A register
3231                                                                  is required if it is
3232                                                                  used explicitly, or
3233                                                                  if a higher numbered
3234                                                                  register is used
3235                                                                  explicitly. This
3236                                                                  includes the special
3237                                                                  SGPRs for VCC, Flat
3238                                                                  Scratch (GFX7-GFX9)
3239                                                                  and XNACK (for
3240                                                                  GFX8-GFX9). It does
3241                                                                  not include the 16
3242                                                                  SGPR added if a trap
3243                                                                  handler is
3244                                                                  enabled. It is not
3245                                                                  rounded up to the
3246                                                                  allocation
3247                                                                  granularity.
3248     ".vgpr_count"                       integer        Required  Number of vector
3249                                                                  registers required by
3250                                                                  each work-item for
3251                                                                  GFX6-GFX9. A register
3252                                                                  is required if it is
3253                                                                  used explicitly, or
3254                                                                  if a higher numbered
3255                                                                  register is used
3256                                                                  explicitly.
3257     ".agpr_count"                       integer        Required  Number of accumulator
3258                                                                  registers required by
3259                                                                  each work-item for
3260                                                                  GFX90A, GFX908.
3261     ".max_flat_workgroup_size"          integer        Required  Maximum flat
3262                                                                  work-group size
3263                                                                  supported by the
3264                                                                  kernel in work-items.
3265                                                                  Must be >=1 and
3266                                                                  consistent with
3267                                                                  ReqdWorkGroupSize if
3268                                                                  not 0, 0, 0.
3269     ".sgpr_spill_count"                 integer                  Number of stores from
3270                                                                  a scalar register to
3271                                                                  a register allocator
3272                                                                  created spill
3273                                                                  location.
3274     ".vgpr_spill_count"                 integer                  Number of stores from
3275                                                                  a vector register to
3276                                                                  a register allocator
3277                                                                  created spill
3278                                                                  location.
3279     ".kind"                             string                   The kind of the kernel
3280                                                                  with the following
3281                                                                  values:
3282
3283                                                                  "normal"
3284                                                                    Regular kernels.
3285
3286                                                                  "init"
3287                                                                    These kernels must be
3288                                                                    invoked after loading
3289                                                                    the containing code
3290                                                                    object and must
3291                                                                    complete before any
3292                                                                    normal and fini
3293                                                                    kernels in the same
3294                                                                    code object are
3295                                                                    invoked.
3296
3297                                                                  "fini"
3298                                                                    These kernels must be
3299                                                                    invoked before
3300                                                                    unloading the
3301                                                                    containing code object
3302                                                                    and after all init and
3303                                                                    normal kernels in the
3304                                                                    same code object have
3305                                                                    been invoked and
3306                                                                    completed.
3307
3308                                                                  If omitted, "normal" is
3309                                                                  assumed.
3310     =================================== ============== ========= ================================
3311
3312..
3313
3314  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3315     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3316
3317     ====================== ============== ========= ================================
3318     String Key             Value Type     Required? Description
3319     ====================== ============== ========= ================================
3320     ".name"                string                   Kernel argument name.
3321     ".type_name"           string                   Kernel argument type name.
3322     ".size"                integer        Required  Kernel argument size in bytes.
3323     ".offset"              integer        Required  Kernel argument offset in
3324                                                     bytes. The offset must be a
3325                                                     multiple of the alignment
3326                                                     required by the argument.
3327     ".value_kind"          string         Required  Kernel argument kind that
3328                                                     specifies how to set up the
3329                                                     corresponding argument.
3330                                                     Values include:
3331
3332                                                     "by_value"
3333                                                       The argument is copied
3334                                                       directly into the kernarg.
3335
3336                                                     "global_buffer"
3337                                                       A global address space pointer
3338                                                       to the buffer data is passed
3339                                                       in the kernarg.
3340
3341                                                     "dynamic_shared_pointer"
3342                                                       A group address space pointer
3343                                                       to dynamically allocated LDS
3344                                                       is passed in the kernarg.
3345
3346                                                     "sampler"
3347                                                       A global address space
3348                                                       pointer to a S# is passed in
3349                                                       the kernarg.
3350
3351                                                     "image"
3352                                                       A global address space
3353                                                       pointer to a T# is passed in
3354                                                       the kernarg.
3355
3356                                                     "pipe"
3357                                                       A global address space pointer
3358                                                       to an OpenCL pipe is passed in
3359                                                       the kernarg.
3360
3361                                                     "queue"
3362                                                       A global address space pointer
3363                                                       to an OpenCL device enqueue
3364                                                       queue is passed in the
3365                                                       kernarg.
3366
3367                                                     "hidden_global_offset_x"
3368                                                       The OpenCL grid dispatch
3369                                                       global offset for the X
3370                                                       dimension is passed in the
3371                                                       kernarg.
3372
3373                                                     "hidden_global_offset_y"
3374                                                       The OpenCL grid dispatch
3375                                                       global offset for the Y
3376                                                       dimension is passed in the
3377                                                       kernarg.
3378
3379                                                     "hidden_global_offset_z"
3380                                                       The OpenCL grid dispatch
3381                                                       global offset for the Z
3382                                                       dimension is passed in the
3383                                                       kernarg.
3384
3385                                                     "hidden_none"
3386                                                       An argument that is not used
3387                                                       by the kernel. Space needs to
3388                                                       be left for it, but it does
3389                                                       not need to be set up.
3390
3391                                                     "hidden_printf_buffer"
3392                                                       A global address space pointer
3393                                                       to the runtime printf buffer
3394                                                       is passed in kernarg. Mutually
3395                                                       exclusive with
3396                                                       "hidden_hostcall_buffer"
3397                                                       before Code Object V5.
3398
3399                                                     "hidden_hostcall_buffer"
3400                                                       A global address space pointer
3401                                                       to the runtime hostcall buffer
3402                                                       is passed in kernarg. Mutually
3403                                                       exclusive with
3404                                                       "hidden_printf_buffer"
3405                                                       before Code Object V5.
3406
3407                                                     "hidden_default_queue"
3408                                                       A global address space pointer
3409                                                       to the OpenCL device enqueue
3410                                                       queue that should be used by
3411                                                       the kernel by default is
3412                                                       passed in the kernarg.
3413
3414                                                     "hidden_completion_action"
3415                                                       A global address space pointer
3416                                                       to help link enqueued kernels into
3417                                                       the ancestor tree for determining
3418                                                       when the parent kernel has finished.
3419
3420                                                     "hidden_multigrid_sync_arg"
3421                                                       A global address space pointer for
3422                                                       multi-grid synchronization is
3423                                                       passed in the kernarg.
3424
3425     ".value_type"          string                    Unused and deprecated. This should no longer
3426                                                      be emitted, but is accepted for compatibility.
3427
3428     ".pointee_align"       integer                  Alignment in bytes of pointee
3429                                                     type for pointer type kernel
3430                                                     argument. Must be a power
3431                                                     of 2. Only present if
3432                                                     ".value_kind" is
3433                                                     "dynamic_shared_pointer".
3434     ".address_space"       string                   Kernel argument address space
3435                                                     qualifier. Only present if
3436                                                     ".value_kind" is "global_buffer" or
3437                                                     "dynamic_shared_pointer". Values
3438                                                     are:
3439
3440                                                     - "private"
3441                                                     - "global"
3442                                                     - "constant"
3443                                                     - "local"
3444                                                     - "generic"
3445                                                     - "region"
3446
3447                                                     .. TODO::
3448
3449                                                        Is "global_buffer" only "global"
3450                                                        or "constant"? Is
3451                                                        "dynamic_shared_pointer" always
3452                                                        "local"? Can HCC allow "generic"?
3453                                                        How can "private" or "region"
3454                                                        ever happen?
3455
3456     ".access"              string                   Kernel argument access
3457                                                     qualifier. Only present if
3458                                                     ".value_kind" is "image" or
3459                                                     "pipe". Values
3460                                                     are:
3461
3462                                                     - "read_only"
3463                                                     - "write_only"
3464                                                     - "read_write"
3465
3466                                                     .. TODO::
3467
3468                                                        Does this apply to
3469                                                        "global_buffer"?
3470
3471     ".actual_access"       string                   The actual memory accesses
3472                                                     performed by the kernel on the
3473                                                     kernel argument. Only present if
3474                                                     ".value_kind" is "global_buffer",
3475                                                     "image", or "pipe". This may be
3476                                                     more restrictive than indicated
3477                                                     by ".access" to reflect what the
3478                                                     kernel actual does. If not
3479                                                     present then the runtime must
3480                                                     assume what is implied by
3481                                                     ".access" and ".is_const"      . Values
3482                                                     are:
3483
3484                                                     - "read_only"
3485                                                     - "write_only"
3486                                                     - "read_write"
3487
3488     ".is_const"            boolean                  Indicates if the kernel argument
3489                                                     is const qualified. Only present
3490                                                     if ".value_kind" is
3491                                                     "global_buffer".
3492
3493     ".is_restrict"         boolean                  Indicates if the kernel argument
3494                                                     is restrict qualified. Only
3495                                                     present if ".value_kind" is
3496                                                     "global_buffer".
3497
3498     ".is_volatile"         boolean                  Indicates if the kernel argument
3499                                                     is volatile qualified. Only
3500                                                     present if ".value_kind" is
3501                                                     "global_buffer".
3502
3503     ".is_pipe"             boolean                  Indicates if the kernel argument
3504                                                     is pipe qualified. Only present
3505                                                     if ".value_kind" is "pipe".
3506
3507                                                     .. TODO::
3508
3509                                                        Can "global_buffer" be pipe
3510                                                        qualified?
3511
3512     ====================== ============== ========= ================================
3513
3514.. _amdgpu-amdhsa-code-object-metadata-v4:
3515
3516Code Object V4 Metadata
3517+++++++++++++++++++++++
3518
3519Code object V4 metadata is the same as
3520:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3521defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3522
3523  .. table:: AMDHSA Code Object V4 Metadata Map Changes
3524     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3525
3526     ================= ============== ========= =======================================
3527     String Key        Value Type     Required? Description
3528     ================= ============== ========= =======================================
3529     "amdhsa.version"  sequence of    Required  - The first integer is the major
3530                       2 integers                 version. Currently 1.
3531                                                - The second integer is the minor
3532                                                  version. Currently 1.
3533     "amdhsa.target"   string         Required  The target name of the code using the syntax:
3534
3535                                                .. code::
3536
3537                                                  <target-triple> [ "-" <target-id> ]
3538
3539                                                A canonical target ID must be
3540                                                used. See :ref:`amdgpu-target-triples`
3541                                                and :ref:`amdgpu-target-id`.
3542     ================= ============== ========= =======================================
3543
3544.. _amdgpu-amdhsa-code-object-metadata-v5:
3545
3546Code Object V5 Metadata
3547+++++++++++++++++++++++
3548
3549.. warning::
3550  Code object V5 is not the default code object version emitted by this version
3551  of LLVM.
3552
3553
3554Code object V5 metadata is the same as
3555:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3556:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table
3557:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3558
3559  .. table:: AMDHSA Code Object V5 Metadata Map Changes
3560     :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3561
3562     ================= ============== ========= =======================================
3563     String Key        Value Type     Required? Description
3564     ================= ============== ========= =======================================
3565     "amdhsa.version"  sequence of    Required  - The first integer is the major
3566                       2 integers                 version. Currently 1.
3567                                                - The second integer is the minor
3568                                                  version. Currently 2.
3569     ================= ============== ========= =======================================
3570
3571..
3572
3573  .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3574     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3575
3576     ====================== ============== ========= ================================
3577     String Key             Value Type     Required? Description
3578     ====================== ============== ========= ================================
3579     ".value_kind"          string         Required  Kernel argument kind that
3580                                                     specifies how to set up the
3581                                                     corresponding argument.
3582                                                     Values include:
3583                                                     the same as code object V3 metadata
3584                                                     (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3585                                                     with the following additions:
3586
3587                                                     "hidden_block_count_x"
3588                                                       The grid dispatch work-group count for the X dimension
3589                                                       is passed in the kernarg. Some languages, such as OpenCL,
3590                                                       support a last work-group in each dimension being partial.
3591                                                       This count only includes the non-partial work-group count.
3592                                                       This is not the same as the value in the AQL dispatch packet,
3593                                                       which has the grid size in work-items.
3594
3595                                                     "hidden_block_count_y"
3596                                                       The grid dispatch work-group count for the Y dimension
3597                                                       is passed in the kernarg. Some languages, such as OpenCL,
3598                                                       support a last work-group in each dimension being partial.
3599                                                       This count only includes the non-partial work-group count.
3600                                                       This is not the same as the value in the AQL dispatch packet,
3601                                                       which has the grid size in work-items. If the grid dimensionality
3602                                                       is 1, then must be 1.
3603
3604                                                     "hidden_block_count_z"
3605                                                       The grid dispatch work-group count for the Z dimension
3606                                                       is passed in the kernarg. Some languages, such as OpenCL,
3607                                                       support a last work-group in each dimension being partial.
3608                                                       This count only includes the non-partial work-group count.
3609                                                       This is not the same as the value in the AQL dispatch packet,
3610                                                       which has the grid size in work-items. If the grid dimensionality
3611                                                       is 1 or 2, then must be 1.
3612
3613                                                     "hidden_group_size_x"
3614                                                       The grid dispatch work-group size for the X dimension is
3615                                                       passed in the kernarg. This size only applies to the
3616                                                       non-partial work-groups. This is the same value as the AQL
3617                                                       dispatch packet work-group size.
3618
3619                                                     "hidden_group_size_y"
3620                                                       The grid dispatch work-group size for the Y dimension is
3621                                                       passed in the kernarg. This size only applies to the
3622                                                       non-partial work-groups. This is the same value as the AQL
3623                                                       dispatch packet work-group size. If the grid dimensionality
3624                                                       is 1, then must be 1.
3625
3626                                                     "hidden_group_size_z"
3627                                                       The grid dispatch work-group size for the Z dimension is
3628                                                       passed in the kernarg. This size only applies to the
3629                                                       non-partial work-groups. This is the same value as the AQL
3630                                                       dispatch packet work-group size. If the grid dimensionality
3631                                                       is 1 or 2, then must be 1.
3632
3633                                                     "hidden_remainder_x"
3634                                                       The grid dispatch work group size of the partial work group
3635                                                       of the X dimension, if it exists. Must be zero if a partial
3636                                                       work group does not exist in the X dimension.
3637
3638                                                     "hidden_remainder_y"
3639                                                       The grid dispatch work group size of the partial work group
3640                                                       of the Y dimension, if it exists. Must be zero if a partial
3641                                                       work group does not exist in the Y dimension.
3642
3643                                                     "hidden_remainder_z"
3644                                                       The grid dispatch work group size of the partial work group
3645                                                       of the Z dimension, if it exists. Must be zero if a partial
3646                                                       work group does not exist in the Z dimension.
3647
3648                                                     "hidden_grid_dims"
3649                                                       The grid dispatch dimensionality. This is the same value
3650                                                       as the AQL dispatch packet dimensionality. Must be a value
3651                                                       between 1 and 3.
3652
3653                                                     "hidden_heap_v1"
3654                                                       A global address space pointer to an initialized memory
3655                                                       buffer that conforms to the requirements of the malloc/free
3656                                                       device library V1 version implementation.
3657
3658                                                     "hidden_private_base"
3659                                                       The high 32 bits of the flat addressing private aperture base.
3660                                                       Only used by GFX8 to allow conversion between private segment
3661                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3662
3663                                                     "hidden_shared_base"
3664                                                       The high 32 bits of the flat addressing shared aperture base.
3665                                                       Only used by GFX8 to allow conversion between shared segment
3666                                                       and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3667
3668                                                     "hidden_queue_ptr"
3669                                                       A global memory address space pointer to the ROCm runtime
3670                                                       ``struct amd_queue_t`` structure for the HSA queue of the
3671                                                       associated dispatch AQL packet. It is only required for pre-GFX9
3672                                                       devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3673
3674     ====================== ============== ========= ================================
3675
3676..
3677
3678Kernel Dispatch
3679~~~~~~~~~~~~~~~
3680
3681The HSA architected queuing language (AQL) defines a user space memory interface
3682that can be used to control the dispatch of kernels, in an agent independent
3683way. An agent can have zero or more AQL queues created for it using an HSA
3684compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3685are 64 bytes) can be placed. See the *HSA Platform System Architecture
3686Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3687
3688The packet processor of a kernel agent is responsible for detecting and
3689dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3690packet processor is implemented by the hardware command processor (CP),
3691asynchronous dispatch controller (ADC) and shader processor input controller
3692(SPI).
3693
3694An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3695the kernel mode driver to initialize and register the AQL queue with CP.
3696
3697To dispatch a kernel the following actions are performed. This can occur in the
3698CPU host program, or from an HSA kernel executing on a GPU.
3699
37001. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3701   executed is obtained.
37022. A pointer to the kernel descriptor (see
3703   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3704   It must be for a kernel that is contained in a code object that was loaded
3705   by an HSA compatible runtime on the kernel agent with which the AQL queue is
3706   associated.
37073. Space is allocated for the kernel arguments using the HSA compatible runtime
3708   allocator for a memory region with the kernarg property for the kernel agent
3709   that will execute the kernel. It must be at least 16-byte aligned.
37104. Kernel argument values are assigned to the kernel argument memory
3711   allocation. The layout is defined in the *HSA Programmer's Language
3712   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3713   kernel argument memory in the same way constant memory is accessed. (Note
3714   that the HSA specification allows an implementation to copy the kernel
3715   argument contents to another location that is accessed by the kernel.)
37165. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3717   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3718   for the packet. The packet must be set up, and the final write must use an
3719   atomic store release to set the packet kind to ensure the packet contents are
3720   visible to the kernel agent. AQL defines a doorbell signal mechanism to
3721   notify the kernel agent that the AQL queue has been updated. These rules, and
3722   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3723   System Architecture Specification* [HSA]_.
37246. A kernel dispatch packet includes information about the actual dispatch,
3725   such as grid and work-group size, together with information from the code
3726   object about the kernel, such as segment sizes. The HSA compatible runtime
3727   queries on the kernel symbol can be used to obtain the code object values
3728   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
37297. CP executes micro-code and is responsible for detecting and setting up the
3730   GPU to execute the wavefronts of a kernel dispatch.
37318. CP ensures that when the a wavefront starts executing the kernel machine
3732   code, the scalar general purpose registers (SGPR) and vector general purpose
3733   registers (VGPR) are set up as required by the machine code. The required
3734   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3735   register state is defined in
3736   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
37379. The prolog of the kernel machine code (see
3738   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3739   before continuing executing the machine code that corresponds to the kernel.
374010. When the kernel dispatch has completed execution, CP signals the completion
3741    signal specified in the kernel dispatch packet if not 0.
3742
3743.. _amdgpu-amdhsa-memory-spaces:
3744
3745Memory Spaces
3746~~~~~~~~~~~~~
3747
3748The memory space properties are:
3749
3750  .. table:: AMDHSA Memory Spaces
3751     :name: amdgpu-amdhsa-memory-spaces-table
3752
3753     ================= =========== ======== ======= ==================
3754     Memory Space Name HSA Segment Hardware Address NULL Value
3755                       Name        Name     Size
3756     ================= =========== ======== ======= ==================
3757     Private           private     scratch  32      0x00000000
3758     Local             group       LDS      32      0xFFFFFFFF
3759     Global            global      global   64      0x0000000000000000
3760     Constant          constant    *same as 64      0x0000000000000000
3761                                   global*
3762     Generic           flat        flat     64      0x0000000000000000
3763     Region            N/A         GDS      32      *not implemented
3764                                                    for AMDHSA*
3765     ================= =========== ======== ======= ==================
3766
3767The global and constant memory spaces both use global virtual addresses, which
3768are the same virtual address space used by the CPU. However, some virtual
3769addresses may only be accessible to the CPU, some only accessible by the GPU,
3770and some by both.
3771
3772Using the constant memory space indicates that the data will not change during
3773the execution of the kernel. This allows scalar read instructions to be
3774used. The vector and scalar L1 caches are invalidated of volatile data before
3775each kernel dispatch execution to allow constant memory to change values between
3776kernel dispatches.
3777
3778The local memory space uses the hardware Local Data Store (LDS) which is
3779automatically allocated when the hardware creates work-groups of wavefronts, and
3780freed when all the wavefronts of a work-group have terminated. The data store
3781(DS) instructions can be used to access it.
3782
3783The private memory space uses the hardware scratch memory support. If the kernel
3784uses scratch, then the hardware allocates memory that is accessed using
3785wavefront lane dword (4 byte) interleaving. The mapping used from private
3786address to physical address is:
3787
3788  ``wavefront-scratch-base +
3789  (private-address * wavefront-size * 4) +
3790  (wavefront-lane-id * 4)``
3791
3792There are different ways that the wavefront scratch base address is determined
3793by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3794memory can be accessed in an interleaved manner using buffer instruction with
3795the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3796instructions, or by flat instructions. If each lane of a wavefront accesses the
3797same private address, the interleaving results in adjacent dwords being accessed
3798and hence requires fewer cache lines to be fetched. Multi-dword access is not
3799supported except by flat and scratch instructions in GFX9-GFX11.
3800
3801The generic address space uses the hardware flat address support available in
3802GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3803local apertures), that are outside the range of addressible global memory, to
3804map from a flat address to a private or local address.
3805
3806FLAT instructions can take a flat address and access global, private (scratch)
3807and group (LDS) memory depending on if the address is within one of the
3808aperture ranges. Flat access to scratch requires hardware aperture setup and
3809setup in the kernel prologue (see
3810:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3811hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3812:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3813
3814To convert between a segment address and a flat address the base address of the
3815apertures address can be used. For GFX7-GFX8 these are available in the
3816:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3817Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3818GFX9-GFX11 the aperture base addresses are directly available as inline constant
3819registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3820address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3821which makes it easier to convert from flat to segment or segment to flat.
3822
3823Image and Samplers
3824~~~~~~~~~~~~~~~~~~
3825
3826Image and sample handles created by an HSA compatible runtime (see
3827:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3828object respectively. In order to support the HSA ``query_sampler`` operations
3829two extra dwords are used to store the HSA BRIG enumeration values for the
3830queries that are not trivially deducible from the S# representation.
3831
3832HSA Signals
3833~~~~~~~~~~~
3834
3835HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3836are 64-bit addresses of a structure allocated in memory accessible from both the
3837CPU and GPU. The structure is defined by the runtime and subject to change
3838between releases. For example, see [AMD-ROCm-github]_.
3839
3840.. _amdgpu-amdhsa-hsa-aql-queue:
3841
3842HSA AQL Queue
3843~~~~~~~~~~~~~
3844
3845The HSA AQL queue structure is defined by an HSA compatible runtime (see
3846:ref:`amdgpu-os`) and subject to change between releases. For example, see
3847[AMD-ROCm-github]_. For some processors it contains fields needed to implement
3848certain language features such as the flat address aperture bases. It also
3849contains fields used by CP such as managing the allocation of scratch memory.
3850
3851.. _amdgpu-amdhsa-kernel-descriptor:
3852
3853Kernel Descriptor
3854~~~~~~~~~~~~~~~~~
3855
3856A kernel descriptor consists of the information needed by CP to initiate the
3857execution of a kernel, including the entry point address of the machine code
3858that implements the kernel.
3859
3860Code Object V3 Kernel Descriptor
3861++++++++++++++++++++++++++++++++
3862
3863CP microcode requires the Kernel descriptor to be allocated on 64-byte
3864alignment.
3865
3866The fields used by CP for code objects before V3 also match those specified in
3867:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3868
3869  .. table:: Code Object V3 Kernel Descriptor
3870     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3871
3872     ======= ======= =============================== ============================
3873     Bits    Size    Field Name                      Description
3874     ======= ======= =============================== ============================
3875     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3876                                                     address space memory
3877                                                     required for a work-group
3878                                                     in bytes. This does not
3879                                                     include any dynamically
3880                                                     allocated local address
3881                                                     space memory that may be
3882                                                     added when the kernel is
3883                                                     dispatched.
3884     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3885                                                     private address space
3886                                                     memory required for a
3887                                                     work-item in bytes.
3888                                                     Additional space may need to
3889                                                     be added to this value if
3890                                                     the call stack has
3891                                                     non-inlined function calls.
3892     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3893                                                     memory pointed to by the
3894                                                     AQL dispatch packet. The
3895                                                     kernarg memory is used to
3896                                                     pass arguments to the
3897                                                     kernel.
3898
3899                                                     * If the kernarg pointer in
3900                                                       the dispatch packet is NULL
3901                                                       then there are no kernel
3902                                                       arguments.
3903                                                     * If the kernarg pointer in
3904                                                       the dispatch packet is
3905                                                       not NULL and this value
3906                                                       is 0 then the kernarg
3907                                                       memory size is
3908                                                       unspecified.
3909                                                     * If the kernarg pointer in
3910                                                       the dispatch packet is
3911                                                       not NULL and this value
3912                                                       is not 0 then the value
3913                                                       specifies the kernarg
3914                                                       memory size in bytes. It
3915                                                       is recommended to provide
3916                                                       a value as it may be used
3917                                                       by CP to optimize making
3918                                                       the kernarg memory
3919                                                       visible to the kernel
3920                                                       code.
3921
3922     127:96  4 bytes                                 Reserved, must be 0.
3923     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3924                                                     negative) from base
3925                                                     address of kernel
3926                                                     descriptor to kernel's
3927                                                     entry point instruction
3928                                                     which must be 256 byte
3929                                                     aligned.
3930     351:272 20                                      Reserved, must be 0.
3931             bytes
3932     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3933                                                       Reserved, must be 0.
3934                                                     GFX90A, GFX940
3935                                                       Compute Shader (CS)
3936                                                       program settings used by
3937                                                       CP to set up
3938                                                       ``COMPUTE_PGM_RSRC3``
3939                                                       configuration
3940                                                       register. See
3941                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3942                                                     GFX10-GFX11
3943                                                       Compute Shader (CS)
3944                                                       program settings used by
3945                                                       CP to set up
3946                                                       ``COMPUTE_PGM_RSRC3``
3947                                                       configuration
3948                                                       register. See
3949                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
3950     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3951                                                     program settings used by
3952                                                     CP to set up
3953                                                     ``COMPUTE_PGM_RSRC1``
3954                                                     configuration
3955                                                     register. See
3956                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
3957     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3958                                                     program settings used by
3959                                                     CP to set up
3960                                                     ``COMPUTE_PGM_RSRC2``
3961                                                     configuration
3962                                                     register. See
3963                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
3964     458:448 7 bits  *See separate bits below.*      Enable the setup of the
3965                                                     SGPR user data registers
3966                                                     (see
3967                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3968
3969                                                     The total number of SGPR
3970                                                     user data registers
3971                                                     requested must not exceed
3972                                                     16 and match value in
3973                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3974                                                     Any requests beyond 16
3975                                                     will be ignored.
3976     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3977                     _BUFFER                         column of
3978                                                     :ref:`amdgpu-processor-table`
3979                                                     specifies *Architected flat
3980                                                     scratch* then not supported
3981                                                     and must be 0,
3982     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3983     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3984     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3985     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3986     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3987                                                     column of
3988                                                     :ref:`amdgpu-processor-table`
3989                                                     specifies *Architected flat
3990                                                     scratch* then not supported
3991                                                     and must be 0,
3992     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3993                     _SIZE
3994     457:455 3 bits                                  Reserved, must be 0.
3995     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3996                                                       Reserved, must be 0.
3997                                                     GFX10-GFX11
3998                                                       - If 0 execute in
3999                                                         wavefront size 64 mode.
4000                                                       - If 1 execute in
4001                                                         native wavefront size
4002                                                         32 mode.
4003     459     1 bit   USES_DYNAMIC_STACK              Indicates if the generated
4004                                                     machine code is using a
4005                                                     dynamically sized stack.
4006     463:460 1 bit                                   Reserved, must be 0.
4007     464     1 bit   RESERVED_464                    Deprecated, must be 0.
4008     467:465 3 bits                                  Reserved, must be 0.
4009     468     1 bit   RESERVED_468                    Deprecated, must be 0.
4010     469:471 3 bits                                  Reserved, must be 0.
4011     511:472 5 bytes                                 Reserved, must be 0.
4012     512     **Total size 64 bytes.**
4013     ======= ====================================================================
4014
4015..
4016
4017  .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4018     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4019
4020     ======= ======= =============================== ===========================================================================
4021     Bits    Size    Field Name                      Description
4022     ======= ======= =============================== ===========================================================================
4023     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
4024                                                     blocks used by each work-item;
4025                                                     granularity is device
4026                                                     specific:
4027
4028                                                     GFX6-GFX9
4029                                                       - vgprs_used 0..256
4030                                                       - max(0, ceil(vgprs_used / 4) - 1)
4031                                                     GFX90A, GFX940
4032                                                       - vgprs_used 0..512
4033                                                       - vgprs_used = align(arch_vgprs, 4)
4034                                                                      + acc_vgprs
4035                                                       - max(0, ceil(vgprs_used / 8) - 1)
4036                                                     GFX10-GFX11 (wavefront size 64)
4037                                                       - max_vgpr 1..256
4038                                                       - max(0, ceil(vgprs_used / 4) - 1)
4039                                                     GFX10-GFX11 (wavefront size 32)
4040                                                       - max_vgpr 1..256
4041                                                       - max(0, ceil(vgprs_used / 8) - 1)
4042
4043                                                     Where vgprs_used is defined
4044                                                     as the highest VGPR number
4045                                                     explicitly referenced plus
4046                                                     one.
4047
4048                                                     Used by CP to set up
4049                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
4050
4051                                                     The
4052                                                     :ref:`amdgpu-assembler`
4053                                                     calculates this
4054                                                     automatically for the
4055                                                     selected processor from
4056                                                     values provided to the
4057                                                     `.amdhsa_kernel` directive
4058                                                     by the
4059                                                     `.amdhsa_next_free_vgpr`
4060                                                     nested directive (see
4061                                                     :ref:`amdhsa-kernel-directives-table`).
4062     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4063                                                     blocks used by a wavefront;
4064                                                     granularity is device
4065                                                     specific:
4066
4067                                                     GFX6-GFX8
4068                                                       - sgprs_used 0..112
4069                                                       - max(0, ceil(sgprs_used / 8) - 1)
4070                                                     GFX9
4071                                                       - sgprs_used 0..112
4072                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
4073                                                     GFX10-GFX11
4074                                                       Reserved, must be 0.
4075                                                       (128 SGPRs always
4076                                                       allocated.)
4077
4078                                                     Where sgprs_used is
4079                                                     defined as the highest
4080                                                     SGPR number explicitly
4081                                                     referenced plus one, plus
4082                                                     a target specific number
4083                                                     of additional special
4084                                                     SGPRs for VCC,
4085                                                     FLAT_SCRATCH (GFX7+) and
4086                                                     XNACK_MASK (GFX8+), and
4087                                                     any additional
4088                                                     target specific
4089                                                     limitations. It does not
4090                                                     include the 16 SGPRs added
4091                                                     if a trap handler is
4092                                                     enabled.
4093
4094                                                     The target specific
4095                                                     limitations and special
4096                                                     SGPR layout are defined in
4097                                                     the hardware
4098                                                     documentation, which can
4099                                                     be found in the
4100                                                     :ref:`amdgpu-processors`
4101                                                     table.
4102
4103                                                     Used by CP to set up
4104                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
4105
4106                                                     The
4107                                                     :ref:`amdgpu-assembler`
4108                                                     calculates this
4109                                                     automatically for the
4110                                                     selected processor from
4111                                                     values provided to the
4112                                                     `.amdhsa_kernel` directive
4113                                                     by the
4114                                                     `.amdhsa_next_free_sgpr`
4115                                                     and `.amdhsa_reserve_*`
4116                                                     nested directives (see
4117                                                     :ref:`amdhsa-kernel-directives-table`).
4118     11:10   2 bits  PRIORITY                        Must be 0.
4119
4120                                                     Start executing wavefront
4121                                                     at the specified priority.
4122
4123                                                     CP is responsible for
4124                                                     filling in
4125                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
4126     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
4127                                                     with specified rounding
4128                                                     mode for single (32
4129                                                     bit) floating point
4130                                                     precision floating point
4131                                                     operations.
4132
4133                                                     Floating point rounding
4134                                                     mode values are defined in
4135                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4136
4137                                                     Used by CP to set up
4138                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4139     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
4140                                                     with specified rounding
4141                                                     denorm mode for half/double (16
4142                                                     and 64-bit) floating point
4143                                                     precision floating point
4144                                                     operations.
4145
4146                                                     Floating point rounding
4147                                                     mode values are defined in
4148                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4149
4150                                                     Used by CP to set up
4151                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4152     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
4153                                                     with specified denorm mode
4154                                                     for single (32
4155                                                     bit)  floating point
4156                                                     precision floating point
4157                                                     operations.
4158
4159                                                     Floating point denorm mode
4160                                                     values are defined in
4161                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4162
4163                                                     Used by CP to set up
4164                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4165     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
4166                                                     with specified denorm mode
4167                                                     for half/double (16
4168                                                     and 64-bit) floating point
4169                                                     precision floating point
4170                                                     operations.
4171
4172                                                     Floating point denorm mode
4173                                                     values are defined in
4174                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4175
4176                                                     Used by CP to set up
4177                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4178     20      1 bit   PRIV                            Must be 0.
4179
4180                                                     Start executing wavefront
4181                                                     in privilege trap handler
4182                                                     mode.
4183
4184                                                     CP is responsible for
4185                                                     filling in
4186                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
4187     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
4188                                                     with DX10 clamp mode
4189                                                     enabled. Used by the vector
4190                                                     ALU to force DX10 style
4191                                                     treatment of NaN's (when
4192                                                     set, clamp NaN to zero,
4193                                                     otherwise pass NaN
4194                                                     through).
4195
4196                                                     Used by CP to set up
4197                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4198     22      1 bit   DEBUG_MODE                      Must be 0.
4199
4200                                                     Start executing wavefront
4201                                                     in single step mode.
4202
4203                                                     CP is responsible for
4204                                                     filling in
4205                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4206     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
4207                                                     with IEEE mode
4208                                                     enabled. Floating point
4209                                                     opcodes that support
4210                                                     exception flag gathering
4211                                                     will quiet and propagate
4212                                                     signaling-NaN inputs per
4213                                                     IEEE 754-2008. Min_dx10 and
4214                                                     max_dx10 become IEEE
4215                                                     754-2008 compliant due to
4216                                                     signaling-NaN propagation
4217                                                     and quieting.
4218
4219                                                     Used by CP to set up
4220                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4221     24      1 bit   BULKY                           Must be 0.
4222
4223                                                     Only one work-group allowed
4224                                                     to execute on a compute
4225                                                     unit.
4226
4227                                                     CP is responsible for
4228                                                     filling in
4229                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
4230     25      1 bit   CDBG_USER                       Must be 0.
4231
4232                                                     Flag that can be used to
4233                                                     control debugging code.
4234
4235                                                     CP is responsible for
4236                                                     filling in
4237                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4238     26      1 bit   FP16_OVFL                       GFX6-GFX8
4239                                                       Reserved, must be 0.
4240                                                     GFX9-GFX11
4241                                                       Wavefront starts execution
4242                                                       with specified fp16 overflow
4243                                                       mode.
4244
4245                                                       - If 0, fp16 overflow generates
4246                                                         +/-INF values.
4247                                                       - If 1, fp16 overflow that is the
4248                                                         result of an +/-INF input value
4249                                                         or divide by 0 produces a +/-INF,
4250                                                         otherwise clamps computed
4251                                                         overflow to +/-MAX_FP16 as
4252                                                         appropriate.
4253
4254                                                       Used by CP to set up
4255                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4256     28:27   2 bits                                  Reserved, must be 0.
4257     29      1 bit    WGP_MODE                       GFX6-GFX9
4258                                                       Reserved, must be 0.
4259                                                     GFX10-GFX11
4260                                                       - If 0 execute work-groups in
4261                                                         CU wavefront execution mode.
4262                                                       - If 1 execute work-groups on
4263                                                         in WGP wavefront execution mode.
4264
4265                                                       See :ref:`amdgpu-amdhsa-memory-model`.
4266
4267                                                       Used by CP to set up
4268                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4269     30      1 bit    MEM_ORDERED                    GFX6-GFX9
4270                                                       Reserved, must be 0.
4271                                                     GFX10-GFX11
4272                                                       Controls the behavior of the
4273                                                       s_waitcnt's vmcnt and vscnt
4274                                                       counters.
4275
4276                                                       - If 0 vmcnt reports completion
4277                                                         of load and atomic with return
4278                                                         out of order with sample
4279                                                         instructions, and the vscnt
4280                                                         reports the completion of
4281                                                         store and atomic without
4282                                                         return in order.
4283                                                       - If 1 vmcnt reports completion
4284                                                         of load, atomic with return
4285                                                         and sample instructions in
4286                                                         order, and the vscnt reports
4287                                                         the completion of store and
4288                                                         atomic without return in order.
4289
4290                                                       Used by CP to set up
4291                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4292     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4293                                                       Reserved, must be 0.
4294                                                     GFX10-GFX11
4295                                                       - If 0 execute SIMD wavefronts
4296                                                         using oldest first policy.
4297                                                       - If 1 execute SIMD wavefronts to
4298                                                         ensure wavefronts will make some
4299                                                         forward progress.
4300
4301                                                       Used by CP to set up
4302                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4303     32      **Total size 4 bytes**
4304     ======= ===================================================================================================================
4305
4306..
4307
4308  .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4309     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4310
4311     ======= ======= =============================== ===========================================================================
4312     Bits    Size    Field Name                      Description
4313     ======= ======= =============================== ===========================================================================
4314     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4315                                                       private segment.
4316                                                     * If the *Target Properties*
4317                                                       column of
4318                                                       :ref:`amdgpu-processor-table`
4319                                                       does not specify
4320                                                       *Architected flat
4321                                                       scratch* then enable the
4322                                                       setup of the SGPR
4323                                                       wavefront scratch offset
4324                                                       system register (see
4325                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4326                                                     * If the *Target Properties*
4327                                                       column of
4328                                                       :ref:`amdgpu-processor-table`
4329                                                       specifies *Architected
4330                                                       flat scratch* then enable
4331                                                       the setup of the
4332                                                       FLAT_SCRATCH register
4333                                                       pair (see
4334                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4335
4336                                                     Used by CP to set up
4337                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4338     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4339                                                     user data
4340                                                     registers requested. This
4341                                                     number must be greater than
4342                                                     or equal to the number of user
4343                                                     data registers enabled.
4344
4345                                                     Used by CP to set up
4346                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4347     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4348
4349                                                     This bit represents
4350                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4351                                                     which is set by the CP if
4352                                                     the runtime has installed a
4353                                                     trap handler.
4354     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4355                                                     system SGPR register for
4356                                                     the work-group id in the X
4357                                                     dimension (see
4358                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4359
4360                                                     Used by CP to set up
4361                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4362     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4363                                                     system SGPR register for
4364                                                     the work-group id in the Y
4365                                                     dimension (see
4366                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4367
4368                                                     Used by CP to set up
4369                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4370     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4371                                                     system SGPR register for
4372                                                     the work-group id in the Z
4373                                                     dimension (see
4374                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4375
4376                                                     Used by CP to set up
4377                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4378     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4379                                                     system SGPR register for
4380                                                     work-group information (see
4381                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4382
4383                                                     Used by CP to set up
4384                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4385     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4386                                                     VGPR system registers used
4387                                                     for the work-item ID.
4388                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4389                                                     defines the values.
4390
4391                                                     Used by CP to set up
4392                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4393     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4394
4395                                                     Wavefront starts execution
4396                                                     with address watch
4397                                                     exceptions enabled which
4398                                                     are generated when L1 has
4399                                                     witnessed a thread access
4400                                                     an *address of
4401                                                     interest*.
4402
4403                                                     CP is responsible for
4404                                                     filling in the address
4405                                                     watch bit in
4406                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4407                                                     according to what the
4408                                                     runtime requests.
4409     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4410
4411                                                     Wavefront starts execution
4412                                                     with memory violation
4413                                                     exceptions exceptions
4414                                                     enabled which are generated
4415                                                     when a memory violation has
4416                                                     occurred for this wavefront from
4417                                                     L1 or LDS
4418                                                     (write-to-read-only-memory,
4419                                                     mis-aligned atomic, LDS
4420                                                     address out of range,
4421                                                     illegal address, etc.).
4422
4423                                                     CP sets the memory
4424                                                     violation bit in
4425                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4426                                                     according to what the
4427                                                     runtime requests.
4428     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4429
4430                                                     CP uses the rounded value
4431                                                     from the dispatch packet,
4432                                                     not this value, as the
4433                                                     dispatch may contain
4434                                                     dynamically allocated group
4435                                                     segment memory. CP writes
4436                                                     directly to
4437                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4438
4439                                                     Amount of group segment
4440                                                     (LDS) to allocate for each
4441                                                     work-group. Granularity is
4442                                                     device specific:
4443
4444                                                     GFX6
4445                                                       roundup(lds-size / (64 * 4))
4446                                                     GFX7-GFX11
4447                                                       roundup(lds-size / (128 * 4))
4448
4449     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4450                     _INVALID_OPERATION              with specified exceptions
4451                                                     enabled.
4452
4453                                                     Used by CP to set up
4454                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
4455                                                     (set from bits 0..6).
4456
4457                                                     IEEE 754 FP Invalid
4458                                                     Operation
4459     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4460                     _SOURCE                         input operands is a
4461                                                     denormal number
4462     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4463                     _DIVISION_BY_ZERO               Zero
4464     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4465                     _OVERFLOW
4466     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4467                     _UNDERFLOW
4468     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4469                     _INEXACT
4470     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4471                     _ZERO                           (rcp_iflag_f32 instruction
4472                                                     only)
4473     31      1 bit                                   Reserved, must be 0.
4474     32      **Total size 4 bytes.**
4475     ======= ===================================================================================================================
4476
4477..
4478
4479  .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4480     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4481
4482     ======= ======= =============================== ===========================================================================
4483     Bits    Size    Field Name                      Description
4484     ======= ======= =============================== ===========================================================================
4485     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4486                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4487                                                     63 - accum-offset = 256.
4488     6:15    10                                      Reserved, must be 0.
4489             bits
4490     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4491                                                       launched in the same CU.
4492                                                     - If 1 the waves of a work-group can be
4493                                                       launched in different CUs. The waves
4494                                                       cannot use S_BARRIER or LDS.
4495     17:31   15                                      Reserved, must be 0.
4496             bits
4497     32      **Total size 4 bytes.**
4498     ======= ===================================================================================================================
4499
4500..
4501
4502  .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4503     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4504
4505     ======= ======= =============================== ===========================================================================
4506     Bits    Size    Field Name                      Description
4507     ======= ======= =============================== ===========================================================================
4508     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
4509                                                     wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4510                                                     of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4511                                                     not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4512     9:4     6 bits  INST_PREF_SIZE                  GFX10
4513                                                       Reserved, must be 0.
4514                                                     GFX11
4515                                                       Number of instruction bytes to prefetch, starting at the kernel's entry
4516                                                       point instruction, before wavefront starts execution. The value is 0..63
4517                                                       with a granularity of 128 bytes.
4518     10      1 bit   TRAP_ON_START                   GFX10
4519                                                       Reserved, must be 0.
4520                                                     GFX11
4521                                                       Must be 0.
4522
4523                                                       If 1, wavefront starts execution by trapping into the trap handler.
4524
4525                                                       CP is responsible for filling in the trap on start bit in
4526                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4527                                                       requests.
4528     11      1 bit   TRAP_ON_END                     GFX10
4529                                                       Reserved, must be 0.
4530                                                     GFX11
4531                                                       Must be 0.
4532
4533                                                       If 1, wavefront execution terminates by trapping into the trap handler.
4534
4535                                                       CP is responsible for filling in the trap on end bit in
4536                                                       ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4537     30:12   19 bits                                 Reserved, must be 0.
4538     31      1 bit   IMAGE_OP                        GFX10
4539                                                       Reserved, must be 0.
4540                                                     GFX11
4541                                                       If 1, the kernel execution contains image instructions. If executed as
4542                                                       part of a graphics pipeline, image read instructions will stall waiting
4543                                                       for any necessary ``WAIT_SYNC`` fence to be performed in order to
4544                                                       indicate that earlier pipeline stages have completed writing to the
4545                                                       image.
4546
4547                                                       Not used for compute kernels that are not part of a graphics pipeline and
4548                                                       must be 0.
4549     32      **Total size 4 bytes.**
4550     ======= ===================================================================================================================
4551
4552..
4553
4554  .. table:: Floating Point Rounding Mode Enumeration Values
4555     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4556
4557     ====================================== ===== ==============================
4558     Enumeration Name                       Value Description
4559     ====================================== ===== ==============================
4560     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4561     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4562     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4563     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4564     ====================================== ===== ==============================
4565
4566..
4567
4568  .. table:: Floating Point Denorm Mode Enumeration Values
4569     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4570
4571     ====================================== ===== ==============================
4572     Enumeration Name                       Value Description
4573     ====================================== ===== ==============================
4574     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4575                                                  Denorms
4576     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4577     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4578     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4579     ====================================== ===== ==============================
4580
4581..
4582
4583  .. table:: System VGPR Work-Item ID Enumeration Values
4584     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4585
4586     ======================================== ===== ============================
4587     Enumeration Name                         Value Description
4588     ======================================== ===== ============================
4589     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4590                                                    ID.
4591     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4592                                                    dimensions ID.
4593     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4594                                                    dimensions ID.
4595     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4596     ======================================== ===== ============================
4597
4598.. _amdgpu-amdhsa-initial-kernel-execution-state:
4599
4600Initial Kernel Execution State
4601~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4602
4603This section defines the register state that will be set up by the packet
4604processor prior to the start of execution of every wavefront. This is limited by
4605the constraints of the hardware controllers of CP/ADC/SPI.
4606
4607The order of the SGPR registers is defined, but the compiler can specify which
4608ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4609fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4610for enabled registers are dense starting at SGPR0: the first enabled register is
4611SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4612an SGPR number.
4613
4614The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4615all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4616using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4617actually initialized. These are then immediately followed by the System SGPRs
4618that are set up by ADC/SPI and can have different values for each wavefront of
4619the grid dispatch.
4620
4621SGPR register initial state is defined in
4622:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4623
4624  .. table:: SGPR Register Set Up Order
4625     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4626
4627     ========== ========================== ====== ==============================
4628     SGPR Order Name                       Number Description
4629                (kernel descriptor enable  of
4630                field)                     SGPRs
4631     ========== ========================== ====== ==============================
4632     First      Private Segment Buffer     4      See
4633                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4634                _segment_buffer)
4635     then       Dispatch Ptr               2      64-bit address of AQL dispatch
4636                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4637                                                  actually executing.
4638     then       Queue Ptr                  2      64-bit address of amd_queue_t
4639                (enable_sgpr_queue_ptr)           object for AQL queue on which
4640                                                  the dispatch packet was
4641                                                  queued.
4642     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4643                (enable_sgpr_kernarg              segment. This is directly
4644                _segment_ptr)                     copied from the
4645                                                  kernarg_address in the kernel
4646                                                  dispatch packet.
4647
4648                                                  Having CP load it once avoids
4649                                                  loading it at the beginning of
4650                                                  every wavefront.
4651     then       Dispatch Id                2      64-bit Dispatch ID of the
4652                (enable_sgpr_dispatch_id)         dispatch packet being
4653                                                  executed.
4654     then       Flat Scratch Init          2      See
4655                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4656                _init)
4657     then       Private Segment Size       1      The 32-bit byte size of a
4658                (enable_sgpr_private              single work-item's memory
4659                _segment_size)                    allocation. This is the
4660                                                  value from the kernel
4661                                                  dispatch packet Private
4662                                                  Segment Byte Size rounded up
4663                                                  by CP to a multiple of
4664                                                  DWORD.
4665
4666                                                  Having CP load it once avoids
4667                                                  loading it at the beginning of
4668                                                  every wavefront.
4669
4670                                                  This is not used for
4671                                                  GFX7-GFX8 since it is the same
4672                                                  value as the second SGPR of
4673                                                  Flat Scratch Init. However, it
4674                                                  may be needed for GFX9-GFX11 which
4675                                                  changes the meaning of the
4676                                                  Flat Scratch Init value.
4677     then       Work-Group Id X            1      32-bit work-group id in X
4678                (enable_sgpr_workgroup_id         dimension of grid for
4679                _X)                               wavefront.
4680     then       Work-Group Id Y            1      32-bit work-group id in Y
4681                (enable_sgpr_workgroup_id         dimension of grid for
4682                _Y)                               wavefront.
4683     then       Work-Group Id Z            1      32-bit work-group id in Z
4684                (enable_sgpr_workgroup_id         dimension of grid for
4685                _Z)                               wavefront.
4686     then       Work-Group Info            1      {first_wavefront, 14'b0000,
4687                (enable_sgpr_workgroup            ordered_append_term[10:0],
4688                _info)                            threadgroup_size_in_wavefronts[5:0]}
4689     then       Scratch Wavefront Offset   1      See
4690                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4691                _segment_wavefront_offset)        and
4692                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4693     ========== ========================== ====== ==============================
4694
4695The order of the VGPR registers is defined, but the compiler can specify which
4696ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4697fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4698for enabled registers are dense starting at VGPR0: the first enabled register is
4699VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4700VGPR number.
4701
4702There are different methods used for the VGPR initial state:
4703
4704* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4705  specifies otherwise, a separate VGPR register is used per work-item ID. The
4706  VGPR register initial state for this method is defined in
4707  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4708* If *Target Properties* column of :ref:`amdgpu-processor-table`
4709  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4710  for all work-item IDs. The register layout for this method is defined in
4711  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4712
4713  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4714     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4715
4716     ========== ========================== ====== ==============================
4717     VGPR Order Name                       Number Description
4718                (kernel descriptor enable  of
4719                field)                     VGPRs
4720     ========== ========================== ====== ==============================
4721     First      Work-Item Id X             1      32-bit work-item id in X
4722                (Always initialized)              dimension of work-group for
4723                                                  wavefront lane.
4724     then       Work-Item Id Y             1      32-bit work-item id in Y
4725                (enable_vgpr_workitem_id          dimension of work-group for
4726                > 0)                              wavefront lane.
4727     then       Work-Item Id Z             1      32-bit work-item id in Z
4728                (enable_vgpr_workitem_id          dimension of work-group for
4729                > 1)                              wavefront lane.
4730     ========== ========================== ====== ==============================
4731
4732..
4733
4734  .. table:: Register Layout for Packed Work-Item ID Method
4735     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4736
4737     ======= ======= ================ =========================================
4738     Bits    Size    Field Name       Description
4739     ======= ======= ================ =========================================
4740     0:9     10 bits Work-Item Id X   Work-item id in X
4741                                      dimension of work-group for
4742                                      wavefront lane.
4743
4744                                      Always initialized.
4745
4746     10:19   10 bits Work-Item Id Y   Work-item id in Y
4747                                      dimension of work-group for
4748                                      wavefront lane.
4749
4750                                      Initialized if enable_vgpr_workitem_id >
4751                                      0, otherwise set to 0.
4752     20:29   10 bits Work-Item Id Z   Work-item id in Z
4753                                      dimension of work-group for
4754                                      wavefront lane.
4755
4756                                      Initialized if enable_vgpr_workitem_id >
4757                                      1, otherwise set to 0.
4758     30:31   2 bits                   Reserved, set to 0.
4759     ======= ======= ================ =========================================
4760
4761The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4762
47631. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4764   registers.
47652. Work-group Id registers X, Y, Z are set by ADC which supports any
4766   combination including none.
47673. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4768   its value cannot be included with the flat scratch init value which is per
4769   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
47704. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4771   or (X, Y, Z).
47725. Flat Scratch register pair initialization is described in
4773   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4774
4775The global segment can be accessed either using buffer instructions (GFX6 which
4776has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4777instructions (GFX9-GFX11).
4778
4779If buffer operations are used, then the compiler can generate a V# with the
4780following properties:
4781
4782* base address of 0
4783* no swizzle
4784* ATC: 1 if IOMMU present (such as APU)
4785* ptr64: 1
4786* MTYPE set to support memory coherence that matches the runtime (such as CC for
4787  APU and NC for dGPU).
4788
4789.. _amdgpu-amdhsa-kernel-prolog:
4790
4791Kernel Prolog
4792~~~~~~~~~~~~~
4793
4794The compiler performs initialization in the kernel prologue depending on the
4795target and information about things like stack usage in the kernel and called
4796functions. Some of this initialization requires the compiler to request certain
4797User and System SGPRs be present in the
4798:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4799:ref:`amdgpu-amdhsa-kernel-descriptor`.
4800
4801.. _amdgpu-amdhsa-kernel-prolog-cfi:
4802
4803CFI
4804+++
4805
48061.  The CFI return address is undefined.
4807
48082.  The CFI CFA is defined using an expression which evaluates to a location
4809    description that comprises one memory location description for the
4810    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4811
4812.. _amdgpu-amdhsa-kernel-prolog-m0:
4813
4814M0
4815++
4816
4817GFX6-GFX8
4818  The M0 register must be initialized with a value at least the total LDS size
4819  if the kernel may access LDS via DS or flat operations. Total LDS size is
4820  available in dispatch packet. For M0, it is also possible to use maximum
4821  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4822  GFX7-GFX8).
4823GFX9-GFX11
4824  The M0 register is not used for range checking LDS accesses and so does not
4825  need to be initialized in the prolog.
4826
4827.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4828
4829Stack Pointer
4830+++++++++++++
4831
4832If the kernel has function calls it must set up the ABI stack pointer described
4833in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4834SGPR32 to the unswizzled scratch offset of the address past the last local
4835allocation.
4836
4837.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4838
4839Frame Pointer
4840+++++++++++++
4841
4842If the kernel needs a frame pointer for the reasons defined in
4843``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4844kernel prolog. If a frame pointer is not required then all uses of the frame
4845pointer are replaced with immediate ``0`` offsets.
4846
4847.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4848
4849Flat Scratch
4850++++++++++++
4851
4852There are different methods used for initializing flat scratch:
4853
4854* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4855  specifies *Does not support generic address space*:
4856
4857  Flat scratch is not supported and there is no flat scratch register pair.
4858
4859* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4860  specifies *Offset flat scratch*:
4861
4862  If the kernel or any function it calls may use flat operations to access
4863  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4864  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4865  Scratch Wavefront Offset SGPR registers (see
4866  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4867
4868  1. The low word of Flat Scratch Init is the 32-bit byte offset from
4869     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4870     being managed by SPI for the queue executing the kernel dispatch. This is
4871     the same value used in the Scratch Segment Buffer V# base address.
4872
4873     CP obtains this from the runtime. (The Scratch Segment Buffer base address
4874     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4875
4876     The prolog must add the value of Scratch Wavefront Offset to get the
4877     wavefront's byte scratch backing memory offset from
4878     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4879
4880     The Scratch Wavefront Offset must also be used as an offset with Private
4881     segment address when using the Scratch Segment Buffer.
4882
4883     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4884     shifted by 8 before moving into FLAT_SCRATCH_HI.
4885
4886     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4887     SGPRn is the highest numbered SGPR allocated to the wavefront).
4888     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4889     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4890     FLAT SCRATCH BASE in flat memory instructions that access the scratch
4891     aperture.
4892  2. The second word of Flat Scratch Init is 32-bit byte size of a single
4893     work-items scratch memory usage.
4894
4895     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4896     checks that the value in the kernel dispatch packet Private Segment Byte
4897     Size is not larger and requests the runtime to increase the queue's scratch
4898     size if necessary.
4899
4900     CP directly loads from the kernel dispatch packet Private Segment Byte Size
4901     field and rounds up to a multiple of DWORD. Having CP load it once avoids
4902     loading it at the beginning of every wavefront.
4903
4904     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4905     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4906     in flat memory instructions.
4907
4908* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4909  specifies *Absolute flat scratch*:
4910
4911  If the kernel or any function it calls may use flat operations to access
4912  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4913  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4914  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4915  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4916
4917  The Flat Scratch Init is the 64-bit address of the base of scratch backing
4918  memory being managed by SPI for the queue executing the kernel dispatch.
4919
4920  CP obtains this from the runtime.
4921
4922  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4923  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4924  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4925  memory instructions.
4926
4927  The Scratch Wavefront Offset must also be used as an offset with Private
4928  segment address when using the Scratch Segment Buffer (see
4929  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4930
4931* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4932  specifies *Architected flat scratch*:
4933
4934  If ENABLE_PRIVATE_SEGMENT is enabled in
4935  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
4936  register pair will be initialized to the 64-bit address of the base of scratch
4937  backing memory being managed by SPI for the queue executing the kernel
4938  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4939  flat scratch base in flat memory instructions.
4940
4941.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4942
4943Private Segment Buffer
4944++++++++++++++++++++++
4945
4946If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4947*Architected flat scratch* then a Private Segment Buffer is not supported.
4948Instead the flat SCRATCH instructions are used.
4949
4950Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4951that are used as a V# to access scratch. CP uses the value provided by the
4952runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4953access the private memory space using a segment address. See
4954:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4955
4956The scratch V# is a four-aligned SGPR and always selected for the kernel as
4957follows:
4958
4959  - If it is known during instruction selection that there is stack usage,
4960    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4961    optimizations are disabled (``-O0``), if stack objects already exist (for
4962    locals, etc.), or if there are any function calls.
4963
4964  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4965    are reserved for the tentative scratch V#. These will be used if it is
4966    determined that spilling is needed.
4967
4968    - If no use is made of the tentative scratch V#, then it is unreserved,
4969      and the register count is determined ignoring it.
4970    - If use is made of the tentative scratch V#, then its register numbers
4971      are shifted to the first four-aligned SGPR index after the highest one
4972      allocated by the register allocator, and all uses are updated. The
4973      register count includes them in the shifted location.
4974    - In either case, if the processor has the SGPR allocation bug, the
4975      tentative allocation is not shifted or unreserved in order to ensure
4976      the register count is higher to workaround the bug.
4977
4978    .. note::
4979
4980      This approach of using a tentative scratch V# and shifting the register
4981      numbers if used avoids having to perform register allocation a second
4982      time if the tentative V# is eliminated. This is more efficient and
4983      avoids the problem that the second register allocation may perform
4984      spilling which will fail as there is no longer a scratch V#.
4985
4986When the kernel prolog code is being emitted it is known whether the scratch V#
4987described above is actually used. If it is, the prolog code must set it up by
4988copying the Private Segment Buffer to the scratch V# registers and then adding
4989the Private Segment Wavefront Offset to the queue base address in the V#. The
4990result is a V# with a base address pointing to the beginning of the wavefront
4991scratch backing memory.
4992
4993The Private Segment Buffer is always requested, but the Private Segment
4994Wavefront Offset is only requested if it is used (see
4995:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4996
4997.. _amdgpu-amdhsa-memory-model:
4998
4999Memory Model
5000~~~~~~~~~~~~
5001
5002This section describes the mapping of the LLVM memory model onto AMDGPU machine
5003code (see :ref:`memmodel`).
5004
5005The AMDGPU backend supports the memory synchronization scopes specified in
5006:ref:`amdgpu-memory-scopes`.
5007
5008The code sequences used to implement the memory model specify the order of
5009instructions that a single thread must execute. The ``s_waitcnt`` and cache
5010management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5011to other memory instructions executed by the same thread. This allows them to be
5012moved earlier or later which can allow them to be combined with other instances
5013of the same instruction, or hoisted/sunk out of loops to improve performance.
5014Only the instructions related to the memory model are given; additional
5015``s_waitcnt`` instructions are required to ensure registers are defined before
5016being used. These may be able to be combined with the memory model ``s_waitcnt``
5017instructions as described above.
5018
5019The AMDGPU backend supports the following memory models:
5020
5021  HSA Memory Model [HSA]_
5022    The HSA memory model uses a single happens-before relation for all address
5023    spaces (see :ref:`amdgpu-address-spaces`).
5024  OpenCL Memory Model [OpenCL]_
5025    The OpenCL memory model which has separate happens-before relations for the
5026    global and local address spaces. Only a fence specifying both global and
5027    local address space, and seq_cst instructions join the relationships. Since
5028    the LLVM ``memfence`` instruction does not allow an address space to be
5029    specified the OpenCL fence has to conservatively assume both local and
5030    global address space was specified. However, optimizations can often be
5031    done to eliminate the additional ``s_waitcnt`` instructions when there are
5032    no intervening memory instructions which access the corresponding address
5033    space. The code sequences in the table indicate what can be omitted for the
5034    OpenCL memory. The target triple environment is used to determine if the
5035    source language is OpenCL (see :ref:`amdgpu-opencl`).
5036
5037``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5038operations.
5039
5040``buffer/global/flat_load/store/atomic`` instructions to global memory are
5041termed vector memory operations.
5042
5043Private address space uses ``buffer_load/store`` using the scratch V#
5044(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5045is accessing the memory, atomic memory orderings are not meaningful, and all
5046accesses are treated as non-atomic.
5047
5048Constant address space uses ``buffer/global_load`` instructions (or equivalent
5049scalar memory instructions). Since the constant address space contents do not
5050change during the execution of a kernel dispatch it is not legal to perform
5051stores, and atomic memory orderings are not meaningful, and all accesses are
5052treated as non-atomic.
5053
5054A memory synchronization scope wider than work-group is not meaningful for the
5055group (LDS) address space and is treated as work-group.
5056
5057The memory model does not support the region address space which is treated as
5058non-atomic.
5059
5060Acquire memory ordering is not meaningful on store atomic instructions and is
5061treated as non-atomic.
5062
5063Release memory ordering is not meaningful on load atomic instructions and is
5064treated a non-atomic.
5065
5066Acquire-release memory ordering is not meaningful on load or store atomic
5067instructions and is treated as acquire and release respectively.
5068
5069The memory order also adds the single thread optimization constraints defined in
5070table
5071:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5072
5073  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5074     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5075
5076     ============ ==============================================================
5077     LLVM Memory  Optimization Constraints
5078     Ordering
5079     ============ ==============================================================
5080     unordered    *none*
5081     monotonic    *none*
5082     acquire      - If a load atomic/atomicrmw then no following load/load
5083                    atomic/store/store atomic/atomicrmw/fence instruction can be
5084                    moved before the acquire.
5085                  - If a fence then same as load atomic, plus no preceding
5086                    associated fence-paired-atomic can be moved after the fence.
5087     release      - If a store atomic/atomicrmw then no preceding load/load
5088                    atomic/store/store atomic/atomicrmw/fence instruction can be
5089                    moved after the release.
5090                  - If a fence then same as store atomic, plus no following
5091                    associated fence-paired-atomic can be moved before the
5092                    fence.
5093     acq_rel      Same constraints as both acquire and release.
5094     seq_cst      - If a load atomic then same constraints as acquire, plus no
5095                    preceding sequentially consistent load atomic/store
5096                    atomic/atomicrmw/fence instruction can be moved after the
5097                    seq_cst.
5098                  - If a store atomic then the same constraints as release, plus
5099                    no following sequentially consistent load atomic/store
5100                    atomic/atomicrmw/fence instruction can be moved before the
5101                    seq_cst.
5102                  - If an atomicrmw/fence then same constraints as acq_rel.
5103     ============ ==============================================================
5104
5105The code sequences used to implement the memory model are defined in the
5106following sections:
5107
5108* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5109* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5110* :ref:`amdgpu-amdhsa-memory-model-gfx940`
5111* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5112
5113.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5114
5115Memory Model GFX6-GFX9
5116++++++++++++++++++++++
5117
5118For GFX6-GFX9:
5119
5120* Each agent has multiple shader arrays (SA).
5121* Each SA has multiple compute units (CU).
5122* Each CU has multiple SIMDs that execute wavefronts.
5123* The wavefronts for a single work-group are executed in the same CU but may be
5124  executed by different SIMDs.
5125* Each CU has a single LDS memory shared by the wavefronts of the work-groups
5126  executing on it.
5127* All LDS operations of a CU are performed as wavefront wide operations in a
5128  global order and involve no caching. Completion is reported to a wavefront in
5129  execution order.
5130* The LDS memory has multiple request queues shared by the SIMDs of a
5131  CU. Therefore, the LDS operations performed by different wavefronts of a
5132  work-group can be reordered relative to each other, which can result in
5133  reordering the visibility of vector memory operations with respect to LDS
5134  operations of other wavefronts in the same work-group. A ``s_waitcnt
5135  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5136  vector memory operations between wavefronts of a work-group, but not between
5137  operations performed by the same wavefront.
5138* The vector memory operations are performed as wavefront wide operations and
5139  completion is reported to a wavefront in execution order. The exception is
5140  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5141  vector memory order if they access LDS memory, and out of LDS operation order
5142  if they access global memory.
5143* The vector memory operations access a single vector L1 cache shared by all
5144  SIMDs a CU. Therefore, no special action is required for coherence between the
5145  lanes of a single wavefront, or for coherence between wavefronts in the same
5146  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5147  wavefronts executing in different work-groups as they may be executing on
5148  different CUs.
5149* The scalar memory operations access a scalar L1 cache shared by all wavefronts
5150  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5151  scalar operations are used in a restricted way so do not impact the memory
5152  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5153* The vector and scalar memory operations use an L2 cache shared by all CUs on
5154  the same agent.
5155* The L2 cache has independent channels to service disjoint ranges of virtual
5156  addresses.
5157* Each CU has a separate request queue per channel. Therefore, the vector and
5158  scalar memory operations performed by wavefronts executing in different
5159  work-groups (which may be executing on different CUs) of an agent can be
5160  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5161  ensure synchronization between vector memory operations of different CUs. It
5162  ensures a previous vector memory operation has completed before executing a
5163  subsequent vector memory or LDS operation and so can be used to meet the
5164  requirements of acquire and release.
5165* The L2 cache can be kept coherent with other agents on some targets, or ranges
5166  of virtual addresses can be set up to bypass it to ensure system coherence.
5167
5168Scalar memory operations are only used to access memory that is proven to not
5169change during the execution of the kernel dispatch. This includes constant
5170address space and global address space for program scope ``const`` variables.
5171Therefore, the kernel machine code does not have to maintain the scalar cache to
5172ensure it is coherent with the vector caches. The scalar and vector caches are
5173invalidated between kernel dispatches by CP since constant address space data
5174may change between kernel dispatch executions. See
5175:ref:`amdgpu-amdhsa-memory-spaces`.
5176
5177The one exception is if scalar writes are used to spill SGPR registers. In this
5178case the AMDGPU backend ensures the memory location used to spill is never
5179accessed by vector memory operations at the same time. If scalar writes are used
5180then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5181return since the locations may be used for vector memory instructions by a
5182future wavefront that uses the same scratch area, or a function call that
5183creates a frame at the same address, respectively. There is no need for a
5184``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5185
5186For kernarg backing memory:
5187
5188* CP invalidates the L1 cache at the start of each kernel dispatch.
5189* On dGPU the kernarg backing memory is allocated in host memory accessed as
5190  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5191  causes it to be treated as non-volatile and so is not invalidated by
5192  ``*_vol``.
5193* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5194  and so the L2 cache will be coherent with the CPU and other agents.
5195
5196Scratch backing memory (which is used for the private address space) is accessed
5197with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5198only accessed by a single thread, and is always write-before-read, there is
5199never a need to invalidate these entries from the L1 cache. Hence all cache
5200invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5201
5202The code sequences used to implement the memory model for GFX6-GFX9 are defined
5203in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5204
5205  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5206     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5207
5208     ============ ============ ============== ========== ================================
5209     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
5210                  Ordering     Sync Scope     Address    GFX6-GFX9
5211                                              Space
5212     ============ ============ ============== ========== ================================
5213     **Non-Atomic**
5214     ------------------------------------------------------------------------------------
5215     load         *none*       *none*         - global   - !volatile & !nontemporal
5216                                              - generic
5217                                              - private    1. buffer/global/flat_load
5218                                              - constant
5219                                                         - !volatile & nontemporal
5220
5221                                                           1. buffer/global/flat_load
5222                                                              glc=1 slc=1
5223
5224                                                         - volatile
5225
5226                                                           1. buffer/global/flat_load
5227                                                              glc=1
5228                                                           2. s_waitcnt vmcnt(0)
5229
5230                                                            - Must happen before
5231                                                              any following volatile
5232                                                              global/generic
5233                                                              load/store.
5234                                                            - Ensures that
5235                                                              volatile
5236                                                              operations to
5237                                                              different
5238                                                              addresses will not
5239                                                              be reordered by
5240                                                              hardware.
5241
5242     load         *none*       *none*         - local    1. ds_load
5243     store        *none*       *none*         - global   - !volatile & !nontemporal
5244                                              - generic
5245                                              - private    1. buffer/global/flat_store
5246                                              - constant
5247                                                         - !volatile & nontemporal
5248
5249                                                           1. buffer/global/flat_store
5250                                                              glc=1 slc=1
5251
5252                                                         - volatile
5253
5254                                                           1. buffer/global/flat_store
5255                                                           2. s_waitcnt vmcnt(0)
5256
5257                                                            - Must happen before
5258                                                              any following volatile
5259                                                              global/generic
5260                                                              load/store.
5261                                                            - Ensures that
5262                                                              volatile
5263                                                              operations to
5264                                                              different
5265                                                              addresses will not
5266                                                              be reordered by
5267                                                              hardware.
5268
5269     store        *none*       *none*         - local    1. ds_store
5270     **Unordered Atomic**
5271     ------------------------------------------------------------------------------------
5272     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5273     store atomic unordered    *any*          *any*      *Same as non-atomic*.
5274     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5275     **Monotonic Atomic**
5276     ------------------------------------------------------------------------------------
5277     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5278                               - wavefront    - local
5279                               - workgroup    - generic
5280     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5281                               - system       - generic     glc=1
5282     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5283                               - wavefront    - generic
5284                               - workgroup
5285                               - agent
5286                               - system
5287     store atomic monotonic    - singlethread - local    1. ds_store
5288                               - wavefront
5289                               - workgroup
5290     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5291                               - wavefront    - generic
5292                               - workgroup
5293                               - agent
5294                               - system
5295     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5296                               - wavefront
5297                               - workgroup
5298     **Acquire Atomic**
5299     ------------------------------------------------------------------------------------
5300     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5301                               - wavefront    - local
5302                                              - generic
5303     load atomic  acquire      - workgroup    - global   1. buffer/global_load
5304     load atomic  acquire      - workgroup    - local    1. ds/flat_load
5305                                              - generic  2. s_waitcnt lgkmcnt(0)
5306
5307                                                           - If OpenCL, omit.
5308                                                           - Must happen before
5309                                                             any following
5310                                                             global/generic
5311                                                             load/load
5312                                                             atomic/store/store
5313                                                             atomic/atomicrmw.
5314                                                           - Ensures any
5315                                                             following global
5316                                                             data read is no
5317                                                             older than a local load
5318                                                             atomic value being
5319                                                             acquired.
5320
5321     load atomic  acquire      - agent        - global   1. buffer/global_load
5322                               - system                     glc=1
5323                                                         2. s_waitcnt vmcnt(0)
5324
5325                                                           - Must happen before
5326                                                             following
5327                                                             buffer_wbinvl1_vol.
5328                                                           - Ensures the load
5329                                                             has completed
5330                                                             before invalidating
5331                                                             the cache.
5332
5333                                                         3. buffer_wbinvl1_vol
5334
5335                                                           - Must happen before
5336                                                             any following
5337                                                             global/generic
5338                                                             load/load
5339                                                             atomic/atomicrmw.
5340                                                           - Ensures that
5341                                                             following
5342                                                             loads will not see
5343                                                             stale global data.
5344
5345     load atomic  acquire      - agent        - generic  1. flat_load glc=1
5346                               - system                  2. s_waitcnt vmcnt(0) &
5347                                                            lgkmcnt(0)
5348
5349                                                           - If OpenCL omit
5350                                                             lgkmcnt(0).
5351                                                           - Must happen before
5352                                                             following
5353                                                             buffer_wbinvl1_vol.
5354                                                           - Ensures the flat_load
5355                                                             has completed
5356                                                             before invalidating
5357                                                             the cache.
5358
5359                                                         3. buffer_wbinvl1_vol
5360
5361                                                           - Must happen before
5362                                                             any following
5363                                                             global/generic
5364                                                             load/load
5365                                                             atomic/atomicrmw.
5366                                                           - Ensures that
5367                                                             following loads
5368                                                             will not see stale
5369                                                             global data.
5370
5371     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5372                               - wavefront    - local
5373                                              - generic
5374     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5375     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5376                                              - generic  2. s_waitcnt lgkmcnt(0)
5377
5378                                                           - If OpenCL, omit.
5379                                                           - Must happen before
5380                                                             any following
5381                                                             global/generic
5382                                                             load/load
5383                                                             atomic/store/store
5384                                                             atomic/atomicrmw.
5385                                                           - Ensures any
5386                                                             following global
5387                                                             data read is no
5388                                                             older than a local
5389                                                             atomicrmw value
5390                                                             being acquired.
5391
5392     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5393                               - system                  2. s_waitcnt vmcnt(0)
5394
5395                                                           - Must happen before
5396                                                             following
5397                                                             buffer_wbinvl1_vol.
5398                                                           - Ensures the
5399                                                             atomicrmw has
5400                                                             completed before
5401                                                             invalidating the
5402                                                             cache.
5403
5404                                                         3. buffer_wbinvl1_vol
5405
5406                                                           - Must happen before
5407                                                             any following
5408                                                             global/generic
5409                                                             load/load
5410                                                             atomic/atomicrmw.
5411                                                           - Ensures that
5412                                                             following loads
5413                                                             will not see stale
5414                                                             global data.
5415
5416     atomicrmw    acquire      - agent        - generic  1. flat_atomic
5417                               - system                  2. s_waitcnt vmcnt(0) &
5418                                                            lgkmcnt(0)
5419
5420                                                           - If OpenCL, omit
5421                                                             lgkmcnt(0).
5422                                                           - Must happen before
5423                                                             following
5424                                                             buffer_wbinvl1_vol.
5425                                                           - Ensures the
5426                                                             atomicrmw has
5427                                                             completed before
5428                                                             invalidating the
5429                                                             cache.
5430
5431                                                         3. buffer_wbinvl1_vol
5432
5433                                                           - Must happen before
5434                                                             any following
5435                                                             global/generic
5436                                                             load/load
5437                                                             atomic/atomicrmw.
5438                                                           - Ensures that
5439                                                             following loads
5440                                                             will not see stale
5441                                                             global data.
5442
5443     fence        acquire      - singlethread *none*     *none*
5444                               - wavefront
5445     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5446
5447                                                           - If OpenCL and
5448                                                             address space is
5449                                                             not generic, omit.
5450                                                           - However, since LLVM
5451                                                             currently has no
5452                                                             address space on
5453                                                             the fence need to
5454                                                             conservatively
5455                                                             always generate. If
5456                                                             fence had an
5457                                                             address space then
5458                                                             set to address
5459                                                             space of OpenCL
5460                                                             fence flag, or to
5461                                                             generic if both
5462                                                             local and global
5463                                                             flags are
5464                                                             specified.
5465                                                           - Must happen after
5466                                                             any preceding
5467                                                             local/generic load
5468                                                             atomic/atomicrmw
5469                                                             with an equal or
5470                                                             wider sync scope
5471                                                             and memory ordering
5472                                                             stronger than
5473                                                             unordered (this is
5474                                                             termed the
5475                                                             fence-paired-atomic).
5476                                                           - Must happen before
5477                                                             any following
5478                                                             global/generic
5479                                                             load/load
5480                                                             atomic/store/store
5481                                                             atomic/atomicrmw.
5482                                                           - Ensures any
5483                                                             following global
5484                                                             data read is no
5485                                                             older than the
5486                                                             value read by the
5487                                                             fence-paired-atomic.
5488
5489     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5490                               - system                     vmcnt(0)
5491
5492                                                           - If OpenCL and
5493                                                             address space is
5494                                                             not generic, omit
5495                                                             lgkmcnt(0).
5496                                                           - However, since LLVM
5497                                                             currently has no
5498                                                             address space on
5499                                                             the fence need to
5500                                                             conservatively
5501                                                             always generate
5502                                                             (see comment for
5503                                                             previous fence).
5504                                                           - Could be split into
5505                                                             separate s_waitcnt
5506                                                             vmcnt(0) and
5507                                                             s_waitcnt
5508                                                             lgkmcnt(0) to allow
5509                                                             them to be
5510                                                             independently moved
5511                                                             according to the
5512                                                             following rules.
5513                                                           - s_waitcnt vmcnt(0)
5514                                                             must happen after
5515                                                             any preceding
5516                                                             global/generic load
5517                                                             atomic/atomicrmw
5518                                                             with an equal or
5519                                                             wider sync scope
5520                                                             and memory ordering
5521                                                             stronger than
5522                                                             unordered (this is
5523                                                             termed the
5524                                                             fence-paired-atomic).
5525                                                           - s_waitcnt lgkmcnt(0)
5526                                                             must happen after
5527                                                             any preceding
5528                                                             local/generic load
5529                                                             atomic/atomicrmw
5530                                                             with an equal or
5531                                                             wider sync scope
5532                                                             and memory ordering
5533                                                             stronger than
5534                                                             unordered (this is
5535                                                             termed the
5536                                                             fence-paired-atomic).
5537                                                           - Must happen before
5538                                                             the following
5539                                                             buffer_wbinvl1_vol.
5540                                                           - Ensures that the
5541                                                             fence-paired atomic
5542                                                             has completed
5543                                                             before invalidating
5544                                                             the
5545                                                             cache. Therefore
5546                                                             any following
5547                                                             locations read must
5548                                                             be no older than
5549                                                             the value read by
5550                                                             the
5551                                                             fence-paired-atomic.
5552
5553                                                         2. buffer_wbinvl1_vol
5554
5555                                                           - Must happen before any
5556                                                             following global/generic
5557                                                             load/load
5558                                                             atomic/store/store
5559                                                             atomic/atomicrmw.
5560                                                           - Ensures that
5561                                                             following loads
5562                                                             will not see stale
5563                                                             global data.
5564
5565     **Release Atomic**
5566     ------------------------------------------------------------------------------------
5567     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5568                               - wavefront    - local
5569                                              - generic
5570     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5571                                              - generic
5572                                                           - If OpenCL, omit.
5573                                                           - Must happen after
5574                                                             any preceding
5575                                                             local/generic
5576                                                             load/store/load
5577                                                             atomic/store
5578                                                             atomic/atomicrmw.
5579                                                           - Must happen before
5580                                                             the following
5581                                                             store.
5582                                                           - Ensures that all
5583                                                             memory operations
5584                                                             to local have
5585                                                             completed before
5586                                                             performing the
5587                                                             store that is being
5588                                                             released.
5589
5590                                                         2. buffer/global/flat_store
5591     store atomic release      - workgroup    - local    1. ds_store
5592     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5593                               - system       - generic     vmcnt(0)
5594
5595                                                           - If OpenCL and
5596                                                             address space is
5597                                                             not generic, omit
5598                                                             lgkmcnt(0).
5599                                                           - Could be split into
5600                                                             separate s_waitcnt
5601                                                             vmcnt(0) and
5602                                                             s_waitcnt
5603                                                             lgkmcnt(0) to allow
5604                                                             them to be
5605                                                             independently moved
5606                                                             according to the
5607                                                             following rules.
5608                                                           - s_waitcnt vmcnt(0)
5609                                                             must happen after
5610                                                             any preceding
5611                                                             global/generic
5612                                                             load/store/load
5613                                                             atomic/store
5614                                                             atomic/atomicrmw.
5615                                                           - s_waitcnt lgkmcnt(0)
5616                                                             must happen after
5617                                                             any preceding
5618                                                             local/generic
5619                                                             load/store/load
5620                                                             atomic/store
5621                                                             atomic/atomicrmw.
5622                                                           - Must happen before
5623                                                             the following
5624                                                             store.
5625                                                           - Ensures that all
5626                                                             memory operations
5627                                                             to memory have
5628                                                             completed before
5629                                                             performing the
5630                                                             store that is being
5631                                                             released.
5632
5633                                                         2. buffer/global/flat_store
5634     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5635                               - wavefront    - local
5636                                              - generic
5637     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5638                                              - generic
5639                                                           - If OpenCL, omit.
5640                                                           - Must happen after
5641                                                             any preceding
5642                                                             local/generic
5643                                                             load/store/load
5644                                                             atomic/store
5645                                                             atomic/atomicrmw.
5646                                                           - Must happen before
5647                                                             the following
5648                                                             atomicrmw.
5649                                                           - Ensures that all
5650                                                             memory operations
5651                                                             to local have
5652                                                             completed before
5653                                                             performing the
5654                                                             atomicrmw that is
5655                                                             being released.
5656
5657                                                         2. buffer/global/flat_atomic
5658     atomicrmw    release      - workgroup    - local    1. ds_atomic
5659     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5660                               - system       - generic     vmcnt(0)
5661
5662                                                           - If OpenCL, omit
5663                                                             lgkmcnt(0).
5664                                                           - Could be split into
5665                                                             separate s_waitcnt
5666                                                             vmcnt(0) and
5667                                                             s_waitcnt
5668                                                             lgkmcnt(0) to allow
5669                                                             them to be
5670                                                             independently moved
5671                                                             according to the
5672                                                             following rules.
5673                                                           - s_waitcnt vmcnt(0)
5674                                                             must happen after
5675                                                             any preceding
5676                                                             global/generic
5677                                                             load/store/load
5678                                                             atomic/store
5679                                                             atomic/atomicrmw.
5680                                                           - s_waitcnt lgkmcnt(0)
5681                                                             must happen after
5682                                                             any preceding
5683                                                             local/generic
5684                                                             load/store/load
5685                                                             atomic/store
5686                                                             atomic/atomicrmw.
5687                                                           - Must happen before
5688                                                             the following
5689                                                             atomicrmw.
5690                                                           - Ensures that all
5691                                                             memory operations
5692                                                             to global and local
5693                                                             have completed
5694                                                             before performing
5695                                                             the atomicrmw that
5696                                                             is being released.
5697
5698                                                         2. buffer/global/flat_atomic
5699     fence        release      - singlethread *none*     *none*
5700                               - wavefront
5701     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5702
5703                                                           - If OpenCL and
5704                                                             address space is
5705                                                             not generic, omit.
5706                                                           - However, since LLVM
5707                                                             currently has no
5708                                                             address space on
5709                                                             the fence need to
5710                                                             conservatively
5711                                                             always generate. If
5712                                                             fence had an
5713                                                             address space then
5714                                                             set to address
5715                                                             space of OpenCL
5716                                                             fence flag, or to
5717                                                             generic if both
5718                                                             local and global
5719                                                             flags are
5720                                                             specified.
5721                                                           - Must happen after
5722                                                             any preceding
5723                                                             local/generic
5724                                                             load/load
5725                                                             atomic/store/store
5726                                                             atomic/atomicrmw.
5727                                                           - Must happen before
5728                                                             any following store
5729                                                             atomic/atomicrmw
5730                                                             with an equal or
5731                                                             wider sync scope
5732                                                             and memory ordering
5733                                                             stronger than
5734                                                             unordered (this is
5735                                                             termed the
5736                                                             fence-paired-atomic).
5737                                                           - Ensures that all
5738                                                             memory operations
5739                                                             to local have
5740                                                             completed before
5741                                                             performing the
5742                                                             following
5743                                                             fence-paired-atomic.
5744
5745     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5746                               - system                     vmcnt(0)
5747
5748                                                           - If OpenCL and
5749                                                             address space is
5750                                                             not generic, omit
5751                                                             lgkmcnt(0).
5752                                                           - If OpenCL and
5753                                                             address space is
5754                                                             local, omit
5755                                                             vmcnt(0).
5756                                                           - However, since LLVM
5757                                                             currently has no
5758                                                             address space on
5759                                                             the fence need to
5760                                                             conservatively
5761                                                             always generate. If
5762                                                             fence had an
5763                                                             address space then
5764                                                             set to address
5765                                                             space of OpenCL
5766                                                             fence flag, or to
5767                                                             generic if both
5768                                                             local and global
5769                                                             flags are
5770                                                             specified.
5771                                                           - Could be split into
5772                                                             separate s_waitcnt
5773                                                             vmcnt(0) and
5774                                                             s_waitcnt
5775                                                             lgkmcnt(0) to allow
5776                                                             them to be
5777                                                             independently moved
5778                                                             according to the
5779                                                             following rules.
5780                                                           - s_waitcnt vmcnt(0)
5781                                                             must happen after
5782                                                             any preceding
5783                                                             global/generic
5784                                                             load/store/load
5785                                                             atomic/store
5786                                                             atomic/atomicrmw.
5787                                                           - s_waitcnt lgkmcnt(0)
5788                                                             must happen after
5789                                                             any preceding
5790                                                             local/generic
5791                                                             load/store/load
5792                                                             atomic/store
5793                                                             atomic/atomicrmw.
5794                                                           - Must happen before
5795                                                             any following store
5796                                                             atomic/atomicrmw
5797                                                             with an equal or
5798                                                             wider sync scope
5799                                                             and memory ordering
5800                                                             stronger than
5801                                                             unordered (this is
5802                                                             termed the
5803                                                             fence-paired-atomic).
5804                                                           - Ensures that all
5805                                                             memory operations
5806                                                             have
5807                                                             completed before
5808                                                             performing the
5809                                                             following
5810                                                             fence-paired-atomic.
5811
5812     **Acquire-Release Atomic**
5813     ------------------------------------------------------------------------------------
5814     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5815                               - wavefront    - local
5816                                              - generic
5817     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5818
5819                                                           - If OpenCL, omit.
5820                                                           - Must happen after
5821                                                             any preceding
5822                                                             local/generic
5823                                                             load/store/load
5824                                                             atomic/store
5825                                                             atomic/atomicrmw.
5826                                                           - Must happen before
5827                                                             the following
5828                                                             atomicrmw.
5829                                                           - Ensures that all
5830                                                             memory operations
5831                                                             to local have
5832                                                             completed before
5833                                                             performing the
5834                                                             atomicrmw that is
5835                                                             being released.
5836
5837                                                         2. buffer/global_atomic
5838
5839     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5840                                                         2. s_waitcnt lgkmcnt(0)
5841
5842                                                           - If OpenCL, omit.
5843                                                           - Must happen before
5844                                                             any following
5845                                                             global/generic
5846                                                             load/load
5847                                                             atomic/store/store
5848                                                             atomic/atomicrmw.
5849                                                           - Ensures any
5850                                                             following global
5851                                                             data read is no
5852                                                             older than the local load
5853                                                             atomic value being
5854                                                             acquired.
5855
5856     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5857
5858                                                           - If OpenCL, omit.
5859                                                           - Must happen after
5860                                                             any preceding
5861                                                             local/generic
5862                                                             load/store/load
5863                                                             atomic/store
5864                                                             atomic/atomicrmw.
5865                                                           - Must happen before
5866                                                             the following
5867                                                             atomicrmw.
5868                                                           - Ensures that all
5869                                                             memory operations
5870                                                             to local have
5871                                                             completed before
5872                                                             performing the
5873                                                             atomicrmw that is
5874                                                             being released.
5875
5876                                                         2. flat_atomic
5877                                                         3. s_waitcnt lgkmcnt(0)
5878
5879                                                           - If OpenCL, omit.
5880                                                           - Must happen before
5881                                                             any following
5882                                                             global/generic
5883                                                             load/load
5884                                                             atomic/store/store
5885                                                             atomic/atomicrmw.
5886                                                           - Ensures any
5887                                                             following global
5888                                                             data read is no
5889                                                             older than a local load
5890                                                             atomic value being
5891                                                             acquired.
5892
5893     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5894                               - system                     vmcnt(0)
5895
5896                                                           - If OpenCL, omit
5897                                                             lgkmcnt(0).
5898                                                           - Could be split into
5899                                                             separate s_waitcnt
5900                                                             vmcnt(0) and
5901                                                             s_waitcnt
5902                                                             lgkmcnt(0) to allow
5903                                                             them to be
5904                                                             independently moved
5905                                                             according to the
5906                                                             following rules.
5907                                                           - s_waitcnt vmcnt(0)
5908                                                             must happen after
5909                                                             any preceding
5910                                                             global/generic
5911                                                             load/store/load
5912                                                             atomic/store
5913                                                             atomic/atomicrmw.
5914                                                           - s_waitcnt lgkmcnt(0)
5915                                                             must happen after
5916                                                             any preceding
5917                                                             local/generic
5918                                                             load/store/load
5919                                                             atomic/store
5920                                                             atomic/atomicrmw.
5921                                                           - Must happen before
5922                                                             the following
5923                                                             atomicrmw.
5924                                                           - Ensures that all
5925                                                             memory operations
5926                                                             to global have
5927                                                             completed before
5928                                                             performing the
5929                                                             atomicrmw that is
5930                                                             being released.
5931
5932                                                         2. buffer/global_atomic
5933                                                         3. s_waitcnt vmcnt(0)
5934
5935                                                           - Must happen before
5936                                                             following
5937                                                             buffer_wbinvl1_vol.
5938                                                           - Ensures the
5939                                                             atomicrmw has
5940                                                             completed before
5941                                                             invalidating the
5942                                                             cache.
5943
5944                                                         4. buffer_wbinvl1_vol
5945
5946                                                           - Must happen before
5947                                                             any following
5948                                                             global/generic
5949                                                             load/load
5950                                                             atomic/atomicrmw.
5951                                                           - Ensures that
5952                                                             following loads
5953                                                             will not see stale
5954                                                             global data.
5955
5956     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5957                               - system                     vmcnt(0)
5958
5959                                                           - If OpenCL, omit
5960                                                             lgkmcnt(0).
5961                                                           - Could be split into
5962                                                             separate s_waitcnt
5963                                                             vmcnt(0) and
5964                                                             s_waitcnt
5965                                                             lgkmcnt(0) to allow
5966                                                             them to be
5967                                                             independently moved
5968                                                             according to the
5969                                                             following rules.
5970                                                           - s_waitcnt vmcnt(0)
5971                                                             must happen after
5972                                                             any preceding
5973                                                             global/generic
5974                                                             load/store/load
5975                                                             atomic/store
5976                                                             atomic/atomicrmw.
5977                                                           - s_waitcnt lgkmcnt(0)
5978                                                             must happen after
5979                                                             any preceding
5980                                                             local/generic
5981                                                             load/store/load
5982                                                             atomic/store
5983                                                             atomic/atomicrmw.
5984                                                           - Must happen before
5985                                                             the following
5986                                                             atomicrmw.
5987                                                           - Ensures that all
5988                                                             memory operations
5989                                                             to global have
5990                                                             completed before
5991                                                             performing the
5992                                                             atomicrmw that is
5993                                                             being released.
5994
5995                                                         2. flat_atomic
5996                                                         3. s_waitcnt vmcnt(0) &
5997                                                            lgkmcnt(0)
5998
5999                                                           - If OpenCL, omit
6000                                                             lgkmcnt(0).
6001                                                           - Must happen before
6002                                                             following
6003                                                             buffer_wbinvl1_vol.
6004                                                           - Ensures the
6005                                                             atomicrmw has
6006                                                             completed before
6007                                                             invalidating the
6008                                                             cache.
6009
6010                                                         4. buffer_wbinvl1_vol
6011
6012                                                           - Must happen before
6013                                                             any following
6014                                                             global/generic
6015                                                             load/load
6016                                                             atomic/atomicrmw.
6017                                                           - Ensures that
6018                                                             following loads
6019                                                             will not see stale
6020                                                             global data.
6021
6022     fence        acq_rel      - singlethread *none*     *none*
6023                               - wavefront
6024     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6025
6026                                                           - If OpenCL and
6027                                                             address space is
6028                                                             not generic, omit.
6029                                                           - However,
6030                                                             since LLVM
6031                                                             currently has no
6032                                                             address space on
6033                                                             the fence need to
6034                                                             conservatively
6035                                                             always generate
6036                                                             (see comment for
6037                                                             previous fence).
6038                                                           - Must happen after
6039                                                             any preceding
6040                                                             local/generic
6041                                                             load/load
6042                                                             atomic/store/store
6043                                                             atomic/atomicrmw.
6044                                                           - Must happen before
6045                                                             any following
6046                                                             global/generic
6047                                                             load/load
6048                                                             atomic/store/store
6049                                                             atomic/atomicrmw.
6050                                                           - Ensures that all
6051                                                             memory operations
6052                                                             to local have
6053                                                             completed before
6054                                                             performing any
6055                                                             following global
6056                                                             memory operations.
6057                                                           - Ensures that the
6058                                                             preceding
6059                                                             local/generic load
6060                                                             atomic/atomicrmw
6061                                                             with an equal or
6062                                                             wider sync scope
6063                                                             and memory ordering
6064                                                             stronger than
6065                                                             unordered (this is
6066                                                             termed the
6067                                                             acquire-fence-paired-atomic)
6068                                                             has completed
6069                                                             before following
6070                                                             global memory
6071                                                             operations. This
6072                                                             satisfies the
6073                                                             requirements of
6074                                                             acquire.
6075                                                           - Ensures that all
6076                                                             previous memory
6077                                                             operations have
6078                                                             completed before a
6079                                                             following
6080                                                             local/generic store
6081                                                             atomic/atomicrmw
6082                                                             with an equal or
6083                                                             wider sync scope
6084                                                             and memory ordering
6085                                                             stronger than
6086                                                             unordered (this is
6087                                                             termed the
6088                                                             release-fence-paired-atomic).
6089                                                             This satisfies the
6090                                                             requirements of
6091                                                             release.
6092
6093     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6094                               - system                     vmcnt(0)
6095
6096                                                           - If OpenCL and
6097                                                             address space is
6098                                                             not generic, omit
6099                                                             lgkmcnt(0).
6100                                                           - However, since LLVM
6101                                                             currently has no
6102                                                             address space on
6103                                                             the fence need to
6104                                                             conservatively
6105                                                             always generate
6106                                                             (see comment for
6107                                                             previous fence).
6108                                                           - Could be split into
6109                                                             separate s_waitcnt
6110                                                             vmcnt(0) and
6111                                                             s_waitcnt
6112                                                             lgkmcnt(0) to allow
6113                                                             them to be
6114                                                             independently moved
6115                                                             according to the
6116                                                             following rules.
6117                                                           - s_waitcnt vmcnt(0)
6118                                                             must happen after
6119                                                             any preceding
6120                                                             global/generic
6121                                                             load/store/load
6122                                                             atomic/store
6123                                                             atomic/atomicrmw.
6124                                                           - s_waitcnt lgkmcnt(0)
6125                                                             must happen after
6126                                                             any preceding
6127                                                             local/generic
6128                                                             load/store/load
6129                                                             atomic/store
6130                                                             atomic/atomicrmw.
6131                                                           - Must happen before
6132                                                             the following
6133                                                             buffer_wbinvl1_vol.
6134                                                           - Ensures that the
6135                                                             preceding
6136                                                             global/local/generic
6137                                                             load
6138                                                             atomic/atomicrmw
6139                                                             with an equal or
6140                                                             wider sync scope
6141                                                             and memory ordering
6142                                                             stronger than
6143                                                             unordered (this is
6144                                                             termed the
6145                                                             acquire-fence-paired-atomic)
6146                                                             has completed
6147                                                             before invalidating
6148                                                             the cache. This
6149                                                             satisfies the
6150                                                             requirements of
6151                                                             acquire.
6152                                                           - Ensures that all
6153                                                             previous memory
6154                                                             operations have
6155                                                             completed before a
6156                                                             following
6157                                                             global/local/generic
6158                                                             store
6159                                                             atomic/atomicrmw
6160                                                             with an equal or
6161                                                             wider sync scope
6162                                                             and memory ordering
6163                                                             stronger than
6164                                                             unordered (this is
6165                                                             termed the
6166                                                             release-fence-paired-atomic).
6167                                                             This satisfies the
6168                                                             requirements of
6169                                                             release.
6170
6171                                                         2. buffer_wbinvl1_vol
6172
6173                                                           - Must happen before
6174                                                             any following
6175                                                             global/generic
6176                                                             load/load
6177                                                             atomic/store/store
6178                                                             atomic/atomicrmw.
6179                                                           - Ensures that
6180                                                             following loads
6181                                                             will not see stale
6182                                                             global data. This
6183                                                             satisfies the
6184                                                             requirements of
6185                                                             acquire.
6186
6187     **Sequential Consistent Atomic**
6188     ------------------------------------------------------------------------------------
6189     load atomic  seq_cst      - singlethread - global   *Same as corresponding
6190                               - wavefront    - local    load atomic acquire,
6191                                              - generic  except must generate
6192                                                         all instructions even
6193                                                         for OpenCL.*
6194     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6195                                              - generic
6196
6197                                                           - Must
6198                                                             happen after
6199                                                             preceding
6200                                                             local/generic load
6201                                                             atomic/store
6202                                                             atomic/atomicrmw
6203                                                             with memory
6204                                                             ordering of seq_cst
6205                                                             and with equal or
6206                                                             wider sync scope.
6207                                                             (Note that seq_cst
6208                                                             fences have their
6209                                                             own s_waitcnt
6210                                                             lgkmcnt(0) and so do
6211                                                             not need to be
6212                                                             considered.)
6213                                                           - Ensures any
6214                                                             preceding
6215                                                             sequential
6216                                                             consistent local
6217                                                             memory instructions
6218                                                             have completed
6219                                                             before executing
6220                                                             this sequentially
6221                                                             consistent
6222                                                             instruction. This
6223                                                             prevents reordering
6224                                                             a seq_cst store
6225                                                             followed by a
6226                                                             seq_cst load. (Note
6227                                                             that seq_cst is
6228                                                             stronger than
6229                                                             acquire/release as
6230                                                             the reordering of
6231                                                             load acquire
6232                                                             followed by a store
6233                                                             release is
6234                                                             prevented by the
6235                                                             s_waitcnt of
6236                                                             the release, but
6237                                                             there is nothing
6238                                                             preventing a store
6239                                                             release followed by
6240                                                             load acquire from
6241                                                             completing out of
6242                                                             order. The s_waitcnt
6243                                                             could be placed after
6244                                                             seq_store or before
6245                                                             the seq_load. We
6246                                                             choose the load to
6247                                                             make the s_waitcnt be
6248                                                             as late as possible
6249                                                             so that the store
6250                                                             may have already
6251                                                             completed.)
6252
6253                                                         2. *Following
6254                                                            instructions same as
6255                                                            corresponding load
6256                                                            atomic acquire,
6257                                                            except must generate
6258                                                            all instructions even
6259                                                            for OpenCL.*
6260     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6261                                                         load atomic acquire,
6262                                                         except must generate
6263                                                         all instructions even
6264                                                         for OpenCL.*
6265
6266     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6267                               - system       - generic     vmcnt(0)
6268
6269                                                           - Could be split into
6270                                                             separate s_waitcnt
6271                                                             vmcnt(0)
6272                                                             and s_waitcnt
6273                                                             lgkmcnt(0) to allow
6274                                                             them to be
6275                                                             independently moved
6276                                                             according to the
6277                                                             following rules.
6278                                                           - s_waitcnt lgkmcnt(0)
6279                                                             must happen after
6280                                                             preceding
6281                                                             global/generic load
6282                                                             atomic/store
6283                                                             atomic/atomicrmw
6284                                                             with memory
6285                                                             ordering of seq_cst
6286                                                             and with equal or
6287                                                             wider sync scope.
6288                                                             (Note that seq_cst
6289                                                             fences have their
6290                                                             own s_waitcnt
6291                                                             lgkmcnt(0) and so do
6292                                                             not need to be
6293                                                             considered.)
6294                                                           - s_waitcnt vmcnt(0)
6295                                                             must happen after
6296                                                             preceding
6297                                                             global/generic load
6298                                                             atomic/store
6299                                                             atomic/atomicrmw
6300                                                             with memory
6301                                                             ordering of seq_cst
6302                                                             and with equal or
6303                                                             wider sync scope.
6304                                                             (Note that seq_cst
6305                                                             fences have their
6306                                                             own s_waitcnt
6307                                                             vmcnt(0) and so do
6308                                                             not need to be
6309                                                             considered.)
6310                                                           - Ensures any
6311                                                             preceding
6312                                                             sequential
6313                                                             consistent global
6314                                                             memory instructions
6315                                                             have completed
6316                                                             before executing
6317                                                             this sequentially
6318                                                             consistent
6319                                                             instruction. This
6320                                                             prevents reordering
6321                                                             a seq_cst store
6322                                                             followed by a
6323                                                             seq_cst load. (Note
6324                                                             that seq_cst is
6325                                                             stronger than
6326                                                             acquire/release as
6327                                                             the reordering of
6328                                                             load acquire
6329                                                             followed by a store
6330                                                             release is
6331                                                             prevented by the
6332                                                             s_waitcnt of
6333                                                             the release, but
6334                                                             there is nothing
6335                                                             preventing a store
6336                                                             release followed by
6337                                                             load acquire from
6338                                                             completing out of
6339                                                             order. The s_waitcnt
6340                                                             could be placed after
6341                                                             seq_store or before
6342                                                             the seq_load. We
6343                                                             choose the load to
6344                                                             make the s_waitcnt be
6345                                                             as late as possible
6346                                                             so that the store
6347                                                             may have already
6348                                                             completed.)
6349
6350                                                         2. *Following
6351                                                            instructions same as
6352                                                            corresponding load
6353                                                            atomic acquire,
6354                                                            except must generate
6355                                                            all instructions even
6356                                                            for OpenCL.*
6357     store atomic seq_cst      - singlethread - global   *Same as corresponding
6358                               - wavefront    - local    store atomic release,
6359                               - workgroup    - generic  except must generate
6360                               - agent                   all instructions even
6361                               - system                  for OpenCL.*
6362     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6363                               - wavefront    - local    atomicrmw acq_rel,
6364                               - workgroup    - generic  except must generate
6365                               - agent                   all instructions even
6366                               - system                  for OpenCL.*
6367     fence        seq_cst      - singlethread *none*     *Same as corresponding
6368                               - wavefront               fence acq_rel,
6369                               - workgroup               except must generate
6370                               - agent                   all instructions even
6371                               - system                  for OpenCL.*
6372     ============ ============ ============== ========== ================================
6373
6374.. _amdgpu-amdhsa-memory-model-gfx90a:
6375
6376Memory Model GFX90A
6377+++++++++++++++++++
6378
6379For GFX90A:
6380
6381* Each agent has multiple shader arrays (SA).
6382* Each SA has multiple compute units (CU).
6383* Each CU has multiple SIMDs that execute wavefronts.
6384* The wavefronts for a single work-group are executed in the same CU but may be
6385  executed by different SIMDs. The exception is when in tgsplit execution mode
6386  when the wavefronts may be executed by different SIMDs in different CUs.
6387* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6388  executing on it. The exception is when in tgsplit execution mode when no LDS
6389  is allocated as wavefronts of the same work-group can be in different CUs.
6390* All LDS operations of a CU are performed as wavefront wide operations in a
6391  global order and involve no caching. Completion is reported to a wavefront in
6392  execution order.
6393* The LDS memory has multiple request queues shared by the SIMDs of a
6394  CU. Therefore, the LDS operations performed by different wavefronts of a
6395  work-group can be reordered relative to each other, which can result in
6396  reordering the visibility of vector memory operations with respect to LDS
6397  operations of other wavefronts in the same work-group. A ``s_waitcnt
6398  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6399  vector memory operations between wavefronts of a work-group, but not between
6400  operations performed by the same wavefront.
6401* The vector memory operations are performed as wavefront wide operations and
6402  completion is reported to a wavefront in execution order. The exception is
6403  that ``flat_load/store/atomic`` instructions can report out of vector memory
6404  order if they access LDS memory, and out of LDS operation order if they access
6405  global memory.
6406* The vector memory operations access a single vector L1 cache shared by all
6407  SIMDs a CU. Therefore:
6408
6409  * No special action is required for coherence between the lanes of a single
6410    wavefront.
6411
6412  * No special action is required for coherence between wavefronts in the same
6413    work-group since they execute on the same CU. The exception is when in
6414    tgsplit execution mode as wavefronts of the same work-group can be in
6415    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6416    the following item.
6417
6418  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6419    executing in different work-groups as they may be executing on different
6420    CUs.
6421
6422* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6423  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6424  scalar operations are used in a restricted way so do not impact the memory
6425  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6426* The vector and scalar memory operations use an L2 cache shared by all CUs on
6427  the same agent.
6428
6429  * The L2 cache has independent channels to service disjoint ranges of virtual
6430    addresses.
6431  * Each CU has a separate request queue per channel. Therefore, the vector and
6432    scalar memory operations performed by wavefronts executing in different
6433    work-groups (which may be executing on different CUs), or the same
6434    work-group if executing in tgsplit mode, of an agent can be reordered
6435    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6436    synchronization between vector memory operations of different CUs. It
6437    ensures a previous vector memory operation has completed before executing a
6438    subsequent vector memory or LDS operation and so can be used to meet the
6439    requirements of acquire and release.
6440  * The L2 cache of one agent can be kept coherent with other agents by:
6441    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6442    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6443    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6444
6445    * Any local memory cache lines will be automatically invalidated by writes
6446      from CUs associated with other L2 caches, or writes from the CPU, due to
6447      the cache probe caused by coherent requests. Coherent requests are caused
6448      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6449      XGMI, and by PCIe requests that are configured to be coherent requests.
6450    * XGMI accesses from the CPU to local memory may be cached on the CPU.
6451      Subsequent access from the GPU will automatically invalidate or writeback
6452      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6453    * Since all work-groups on the same agent share the same L2, no L2
6454      invalidation or writeback is required for coherence.
6455    * To ensure coherence of local and remote memory writes of work-groups in
6456      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6457      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6458      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6459      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6460      remote fine grain memory) bypasses the L2, so both will never result in
6461      dirty L2 cache lines.
6462    * To ensure coherence of local and remote memory reads of work-groups in
6463      different agents a ``buffer_invl2`` is required. It will invalidate L2
6464      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6465      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6466      coarse memory) cause local reads to be invalidated by remote writes with
6467      with the PTE C-bit so these cache lines are not invalidated. Note that
6468      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6469      never result in L2 cache lines that need to be invalidated.
6470
6471  * PCIe access from the GPU to the CPU memory is kept coherent by using the
6472    MTYPE UC (uncached) which bypasses the L2.
6473
6474Scalar memory operations are only used to access memory that is proven to not
6475change during the execution of the kernel dispatch. This includes constant
6476address space and global address space for program scope ``const`` variables.
6477Therefore, the kernel machine code does not have to maintain the scalar cache to
6478ensure it is coherent with the vector caches. The scalar and vector caches are
6479invalidated between kernel dispatches by CP since constant address space data
6480may change between kernel dispatch executions. See
6481:ref:`amdgpu-amdhsa-memory-spaces`.
6482
6483The one exception is if scalar writes are used to spill SGPR registers. In this
6484case the AMDGPU backend ensures the memory location used to spill is never
6485accessed by vector memory operations at the same time. If scalar writes are used
6486then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6487return since the locations may be used for vector memory instructions by a
6488future wavefront that uses the same scratch area, or a function call that
6489creates a frame at the same address, respectively. There is no need for a
6490``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6491
6492For kernarg backing memory:
6493
6494* CP invalidates the L1 cache at the start of each kernel dispatch.
6495* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6496  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6497  cache. This also causes it to be treated as non-volatile and so is not
6498  invalidated by ``*_vol``.
6499* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6500  so the L2 cache will be coherent with the CPU and other agents.
6501
6502Scratch backing memory (which is used for the private address space) is accessed
6503with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6504only accessed by a single thread, and is always write-before-read, there is
6505never a need to invalidate these entries from the L1 cache. Hence all cache
6506invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6507
6508The code sequences used to implement the memory model for GFX90A are defined
6509in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6510
6511  .. table:: AMDHSA Memory Model Code Sequences GFX90A
6512     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6513
6514     ============ ============ ============== ========== ================================
6515     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6516                  Ordering     Sync Scope     Address    GFX90A
6517                                              Space
6518     ============ ============ ============== ========== ================================
6519     **Non-Atomic**
6520     ------------------------------------------------------------------------------------
6521     load         *none*       *none*         - global   - !volatile & !nontemporal
6522                                              - generic
6523                                              - private    1. buffer/global/flat_load
6524                                              - constant
6525                                                         - !volatile & nontemporal
6526
6527                                                           1. buffer/global/flat_load
6528                                                              glc=1 slc=1
6529
6530                                                         - volatile
6531
6532                                                           1. buffer/global/flat_load
6533                                                              glc=1
6534                                                           2. s_waitcnt vmcnt(0)
6535
6536                                                            - Must happen before
6537                                                              any following volatile
6538                                                              global/generic
6539                                                              load/store.
6540                                                            - Ensures that
6541                                                              volatile
6542                                                              operations to
6543                                                              different
6544                                                              addresses will not
6545                                                              be reordered by
6546                                                              hardware.
6547
6548     load         *none*       *none*         - local    1. ds_load
6549     store        *none*       *none*         - global   - !volatile & !nontemporal
6550                                              - generic
6551                                              - private    1. buffer/global/flat_store
6552                                              - constant
6553                                                         - !volatile & nontemporal
6554
6555                                                           1. buffer/global/flat_store
6556                                                              glc=1 slc=1
6557
6558                                                         - volatile
6559
6560                                                           1. buffer/global/flat_store
6561                                                           2. s_waitcnt vmcnt(0)
6562
6563                                                            - Must happen before
6564                                                              any following volatile
6565                                                              global/generic
6566                                                              load/store.
6567                                                            - Ensures that
6568                                                              volatile
6569                                                              operations to
6570                                                              different
6571                                                              addresses will not
6572                                                              be reordered by
6573                                                              hardware.
6574
6575     store        *none*       *none*         - local    1. ds_store
6576     **Unordered Atomic**
6577     ------------------------------------------------------------------------------------
6578     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6579     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6580     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6581     **Monotonic Atomic**
6582     ------------------------------------------------------------------------------------
6583     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6584                               - wavefront    - generic
6585     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6586                                              - generic     glc=1
6587
6588                                                           - If not TgSplit execution
6589                                                             mode, omit glc=1.
6590
6591     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6592                               - wavefront               local address space cannot
6593                               - workgroup               be used.*
6594
6595                                                         1. ds_load
6596     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6597                                              - generic     glc=1
6598     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6599                                              - generic     glc=1
6600     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6601                               - wavefront    - generic
6602                               - workgroup
6603                               - agent
6604     store atomic monotonic    - system       - global   1. buffer/global/flat_store
6605                                              - generic
6606     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6607                               - wavefront               local address space cannot
6608                               - workgroup               be used.*
6609
6610                                                         1. ds_store
6611     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6612                               - wavefront    - generic
6613                               - workgroup
6614                               - agent
6615     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6616                                              - generic
6617     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6618                               - wavefront               local address space cannot
6619                               - workgroup               be used.*
6620
6621                                                         1. ds_atomic
6622     **Acquire Atomic**
6623     ------------------------------------------------------------------------------------
6624     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6625                               - wavefront    - local
6626                                              - generic
6627     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6628
6629                                                           - If not TgSplit execution
6630                                                             mode, omit glc=1.
6631
6632                                                         2. s_waitcnt vmcnt(0)
6633
6634                                                           - If not TgSplit execution
6635                                                             mode, omit.
6636                                                           - Must happen before the
6637                                                             following buffer_wbinvl1_vol.
6638
6639                                                         3. buffer_wbinvl1_vol
6640
6641                                                           - If not TgSplit execution
6642                                                             mode, omit.
6643                                                           - Must happen before
6644                                                             any following
6645                                                             global/generic
6646                                                             load/load
6647                                                             atomic/store/store
6648                                                             atomic/atomicrmw.
6649                                                           - Ensures that
6650                                                             following
6651                                                             loads will not see
6652                                                             stale data.
6653
6654     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6655                                                         local address space cannot
6656                                                         be used.*
6657
6658                                                         1. ds_load
6659                                                         2. s_waitcnt lgkmcnt(0)
6660
6661                                                           - If OpenCL, omit.
6662                                                           - Must happen before
6663                                                             any following
6664                                                             global/generic
6665                                                             load/load
6666                                                             atomic/store/store
6667                                                             atomic/atomicrmw.
6668                                                           - Ensures any
6669                                                             following global
6670                                                             data read is no
6671                                                             older than the local load
6672                                                             atomic value being
6673                                                             acquired.
6674
6675     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6676
6677                                                           - If not TgSplit execution
6678                                                             mode, omit glc=1.
6679
6680                                                         2. s_waitcnt lgkm/vmcnt(0)
6681
6682                                                           - Use lgkmcnt(0) if not
6683                                                             TgSplit execution mode
6684                                                             and vmcnt(0) if TgSplit
6685                                                             execution mode.
6686                                                           - If OpenCL, omit lgkmcnt(0).
6687                                                           - Must happen before
6688                                                             the following
6689                                                             buffer_wbinvl1_vol and any
6690                                                             following global/generic
6691                                                             load/load
6692                                                             atomic/store/store
6693                                                             atomic/atomicrmw.
6694                                                           - Ensures any
6695                                                             following global
6696                                                             data read is no
6697                                                             older than a local load
6698                                                             atomic value being
6699                                                             acquired.
6700
6701                                                         3. buffer_wbinvl1_vol
6702
6703                                                           - If not TgSplit execution
6704                                                             mode, omit.
6705                                                           - Ensures that
6706                                                             following
6707                                                             loads will not see
6708                                                             stale data.
6709
6710     load atomic  acquire      - agent        - global   1. buffer/global_load
6711                                                            glc=1
6712                                                         2. s_waitcnt vmcnt(0)
6713
6714                                                           - Must happen before
6715                                                             following
6716                                                             buffer_wbinvl1_vol.
6717                                                           - Ensures the load
6718                                                             has completed
6719                                                             before invalidating
6720                                                             the cache.
6721
6722                                                         3. buffer_wbinvl1_vol
6723
6724                                                           - Must happen before
6725                                                             any following
6726                                                             global/generic
6727                                                             load/load
6728                                                             atomic/atomicrmw.
6729                                                           - Ensures that
6730                                                             following
6731                                                             loads will not see
6732                                                             stale global data.
6733
6734     load atomic  acquire      - system       - global   1. buffer/global/flat_load
6735                                                            glc=1
6736                                                         2. s_waitcnt vmcnt(0)
6737
6738                                                           - Must happen before
6739                                                             following buffer_invl2 and
6740                                                             buffer_wbinvl1_vol.
6741                                                           - Ensures the load
6742                                                             has completed
6743                                                             before invalidating
6744                                                             the cache.
6745
6746                                                         3. buffer_invl2;
6747                                                            buffer_wbinvl1_vol
6748
6749                                                           - Must happen before
6750                                                             any following
6751                                                             global/generic
6752                                                             load/load
6753                                                             atomic/atomicrmw.
6754                                                           - Ensures that
6755                                                             following
6756                                                             loads will not see
6757                                                             stale L1 global data,
6758                                                             nor see stale L2 MTYPE
6759                                                             NC global data.
6760                                                             MTYPE RW and CC memory will
6761                                                             never be stale in L2 due to
6762                                                             the memory probes.
6763
6764     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6765                                                         2. s_waitcnt vmcnt(0) &
6766                                                            lgkmcnt(0)
6767
6768                                                           - If TgSplit execution mode,
6769                                                             omit lgkmcnt(0).
6770                                                           - If OpenCL omit
6771                                                             lgkmcnt(0).
6772                                                           - Must happen before
6773                                                             following
6774                                                             buffer_wbinvl1_vol.
6775                                                           - Ensures the flat_load
6776                                                             has completed
6777                                                             before invalidating
6778                                                             the cache.
6779
6780                                                         3. buffer_wbinvl1_vol
6781
6782                                                           - Must happen before
6783                                                             any following
6784                                                             global/generic
6785                                                             load/load
6786                                                             atomic/atomicrmw.
6787                                                           - Ensures that
6788                                                             following loads
6789                                                             will not see stale
6790                                                             global data.
6791
6792     load atomic  acquire      - system       - generic  1. flat_load glc=1
6793                                                         2. s_waitcnt vmcnt(0) &
6794                                                            lgkmcnt(0)
6795
6796                                                           - If TgSplit execution mode,
6797                                                             omit lgkmcnt(0).
6798                                                           - If OpenCL omit
6799                                                             lgkmcnt(0).
6800                                                           - Must happen before
6801                                                             following
6802                                                             buffer_invl2 and
6803                                                             buffer_wbinvl1_vol.
6804                                                           - Ensures the flat_load
6805                                                             has completed
6806                                                             before invalidating
6807                                                             the caches.
6808
6809                                                         3. buffer_invl2;
6810                                                            buffer_wbinvl1_vol
6811
6812                                                           - Must happen before
6813                                                             any following
6814                                                             global/generic
6815                                                             load/load
6816                                                             atomic/atomicrmw.
6817                                                           - Ensures that
6818                                                             following
6819                                                             loads will not see
6820                                                             stale L1 global data,
6821                                                             nor see stale L2 MTYPE
6822                                                             NC global data.
6823                                                             MTYPE RW and CC memory will
6824                                                             never be stale in L2 due to
6825                                                             the memory probes.
6826
6827     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6828                               - wavefront    - generic
6829     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6830                               - wavefront               local address space cannot
6831                                                         be used.*
6832
6833                                                         1. ds_atomic
6834     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6835                                                         2. s_waitcnt vmcnt(0)
6836
6837                                                           - If not TgSplit execution
6838                                                             mode, omit.
6839                                                           - Must happen before the
6840                                                             following buffer_wbinvl1_vol.
6841                                                           - Ensures the atomicrmw
6842                                                             has completed
6843                                                             before invalidating
6844                                                             the cache.
6845
6846                                                         3. buffer_wbinvl1_vol
6847
6848                                                           - If not TgSplit execution
6849                                                             mode, omit.
6850                                                           - Must happen before
6851                                                             any following
6852                                                             global/generic
6853                                                             load/load
6854                                                             atomic/atomicrmw.
6855                                                           - Ensures that
6856                                                             following loads
6857                                                             will not see stale
6858                                                             global data.
6859
6860     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6861                                                         local address space cannot
6862                                                         be used.*
6863
6864                                                         1. ds_atomic
6865                                                         2. s_waitcnt lgkmcnt(0)
6866
6867                                                           - If OpenCL, omit.
6868                                                           - Must happen before
6869                                                             any following
6870                                                             global/generic
6871                                                             load/load
6872                                                             atomic/store/store
6873                                                             atomic/atomicrmw.
6874                                                           - Ensures any
6875                                                             following global
6876                                                             data read is no
6877                                                             older than the local
6878                                                             atomicrmw value
6879                                                             being acquired.
6880
6881     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6882                                                         2. s_waitcnt lgkm/vmcnt(0)
6883
6884                                                           - Use lgkmcnt(0) if not
6885                                                             TgSplit execution mode
6886                                                             and vmcnt(0) if TgSplit
6887                                                             execution mode.
6888                                                           - If OpenCL, omit lgkmcnt(0).
6889                                                           - Must happen before
6890                                                             the following
6891                                                             buffer_wbinvl1_vol and
6892                                                             any following
6893                                                             global/generic
6894                                                             load/load
6895                                                             atomic/store/store
6896                                                             atomic/atomicrmw.
6897                                                           - Ensures any
6898                                                             following global
6899                                                             data read is no
6900                                                             older than a local
6901                                                             atomicrmw value
6902                                                             being acquired.
6903
6904                                                         3. buffer_wbinvl1_vol
6905
6906                                                           - If not TgSplit execution
6907                                                             mode, omit.
6908                                                           - Ensures that
6909                                                             following
6910                                                             loads will not see
6911                                                             stale data.
6912
6913     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6914                                                         2. s_waitcnt vmcnt(0)
6915
6916                                                           - Must happen before
6917                                                             following
6918                                                             buffer_wbinvl1_vol.
6919                                                           - Ensures the
6920                                                             atomicrmw has
6921                                                             completed before
6922                                                             invalidating the
6923                                                             cache.
6924
6925                                                         3. buffer_wbinvl1_vol
6926
6927                                                           - Must happen before
6928                                                             any following
6929                                                             global/generic
6930                                                             load/load
6931                                                             atomic/atomicrmw.
6932                                                           - Ensures that
6933                                                             following loads
6934                                                             will not see stale
6935                                                             global data.
6936
6937     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6938                                                         2. s_waitcnt vmcnt(0)
6939
6940                                                           - Must happen before
6941                                                             following buffer_invl2 and
6942                                                             buffer_wbinvl1_vol.
6943                                                           - Ensures the
6944                                                             atomicrmw has
6945                                                             completed before
6946                                                             invalidating the
6947                                                             caches.
6948
6949                                                         3. buffer_invl2;
6950                                                            buffer_wbinvl1_vol
6951
6952                                                           - Must happen before
6953                                                             any following
6954                                                             global/generic
6955                                                             load/load
6956                                                             atomic/atomicrmw.
6957                                                           - Ensures that
6958                                                             following
6959                                                             loads will not see
6960                                                             stale L1 global data,
6961                                                             nor see stale L2 MTYPE
6962                                                             NC global data.
6963                                                             MTYPE RW and CC memory will
6964                                                             never be stale in L2 due to
6965                                                             the memory probes.
6966
6967     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6968                                                         2. s_waitcnt vmcnt(0) &
6969                                                            lgkmcnt(0)
6970
6971                                                           - If TgSplit execution mode,
6972                                                             omit lgkmcnt(0).
6973                                                           - If OpenCL, omit
6974                                                             lgkmcnt(0).
6975                                                           - Must happen before
6976                                                             following
6977                                                             buffer_wbinvl1_vol.
6978                                                           - Ensures the
6979                                                             atomicrmw has
6980                                                             completed before
6981                                                             invalidating the
6982                                                             cache.
6983
6984                                                         3. buffer_wbinvl1_vol
6985
6986                                                           - Must happen before
6987                                                             any following
6988                                                             global/generic
6989                                                             load/load
6990                                                             atomic/atomicrmw.
6991                                                           - Ensures that
6992                                                             following loads
6993                                                             will not see stale
6994                                                             global data.
6995
6996     atomicrmw    acquire      - system       - generic  1. flat_atomic
6997                                                         2. s_waitcnt vmcnt(0) &
6998                                                            lgkmcnt(0)
6999
7000                                                           - If TgSplit execution mode,
7001                                                             omit lgkmcnt(0).
7002                                                           - If OpenCL, omit
7003                                                             lgkmcnt(0).
7004                                                           - Must happen before
7005                                                             following
7006                                                             buffer_invl2 and
7007                                                             buffer_wbinvl1_vol.
7008                                                           - Ensures the
7009                                                             atomicrmw has
7010                                                             completed before
7011                                                             invalidating the
7012                                                             caches.
7013
7014                                                         3. buffer_invl2;
7015                                                            buffer_wbinvl1_vol
7016
7017                                                           - Must happen before
7018                                                             any following
7019                                                             global/generic
7020                                                             load/load
7021                                                             atomic/atomicrmw.
7022                                                           - Ensures that
7023                                                             following
7024                                                             loads will not see
7025                                                             stale L1 global data,
7026                                                             nor see stale L2 MTYPE
7027                                                             NC global data.
7028                                                             MTYPE RW and CC memory will
7029                                                             never be stale in L2 due to
7030                                                             the memory probes.
7031
7032     fence        acquire      - singlethread *none*     *none*
7033                               - wavefront
7034     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7035
7036                                                           - Use lgkmcnt(0) if not
7037                                                             TgSplit execution mode
7038                                                             and vmcnt(0) if TgSplit
7039                                                             execution mode.
7040                                                           - If OpenCL and
7041                                                             address space is
7042                                                             not generic, omit
7043                                                             lgkmcnt(0).
7044                                                           - If OpenCL and
7045                                                             address space is
7046                                                             local, omit
7047                                                             vmcnt(0).
7048                                                           - However, since LLVM
7049                                                             currently has no
7050                                                             address space on
7051                                                             the fence need to
7052                                                             conservatively
7053                                                             always generate. If
7054                                                             fence had an
7055                                                             address space then
7056                                                             set to address
7057                                                             space of OpenCL
7058                                                             fence flag, or to
7059                                                             generic if both
7060                                                             local and global
7061                                                             flags are
7062                                                             specified.
7063                                                           - s_waitcnt vmcnt(0)
7064                                                             must happen after
7065                                                             any preceding
7066                                                             global/generic load
7067                                                             atomic/
7068                                                             atomicrmw
7069                                                             with an equal or
7070                                                             wider sync scope
7071                                                             and memory ordering
7072                                                             stronger than
7073                                                             unordered (this is
7074                                                             termed the
7075                                                             fence-paired-atomic).
7076                                                           - s_waitcnt lgkmcnt(0)
7077                                                             must happen after
7078                                                             any preceding
7079                                                             local/generic load
7080                                                             atomic/atomicrmw
7081                                                             with an equal or
7082                                                             wider sync scope
7083                                                             and memory ordering
7084                                                             stronger than
7085                                                             unordered (this is
7086                                                             termed the
7087                                                             fence-paired-atomic).
7088                                                           - Must happen before
7089                                                             the following
7090                                                             buffer_wbinvl1_vol and
7091                                                             any following
7092                                                             global/generic
7093                                                             load/load
7094                                                             atomic/store/store
7095                                                             atomic/atomicrmw.
7096                                                           - Ensures any
7097                                                             following global
7098                                                             data read is no
7099                                                             older than the
7100                                                             value read by the
7101                                                             fence-paired-atomic.
7102
7103                                                         2. buffer_wbinvl1_vol
7104
7105                                                           - If not TgSplit execution
7106                                                             mode, omit.
7107                                                           - Ensures that
7108                                                             following
7109                                                             loads will not see
7110                                                             stale data.
7111
7112     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7113                                                            vmcnt(0)
7114
7115                                                           - If TgSplit execution mode,
7116                                                             omit lgkmcnt(0).
7117                                                           - If OpenCL and
7118                                                             address space is
7119                                                             not generic, omit
7120                                                             lgkmcnt(0).
7121                                                           - However, since LLVM
7122                                                             currently has no
7123                                                             address space on
7124                                                             the fence need to
7125                                                             conservatively
7126                                                             always generate
7127                                                             (see comment for
7128                                                             previous fence).
7129                                                           - Could be split into
7130                                                             separate s_waitcnt
7131                                                             vmcnt(0) and
7132                                                             s_waitcnt
7133                                                             lgkmcnt(0) to allow
7134                                                             them to be
7135                                                             independently moved
7136                                                             according to the
7137                                                             following rules.
7138                                                           - s_waitcnt vmcnt(0)
7139                                                             must happen after
7140                                                             any preceding
7141                                                             global/generic load
7142                                                             atomic/atomicrmw
7143                                                             with an equal or
7144                                                             wider sync scope
7145                                                             and memory ordering
7146                                                             stronger than
7147                                                             unordered (this is
7148                                                             termed the
7149                                                             fence-paired-atomic).
7150                                                           - s_waitcnt lgkmcnt(0)
7151                                                             must happen after
7152                                                             any preceding
7153                                                             local/generic load
7154                                                             atomic/atomicrmw
7155                                                             with an equal or
7156                                                             wider sync scope
7157                                                             and memory ordering
7158                                                             stronger than
7159                                                             unordered (this is
7160                                                             termed the
7161                                                             fence-paired-atomic).
7162                                                           - Must happen before
7163                                                             the following
7164                                                             buffer_wbinvl1_vol.
7165                                                           - Ensures that the
7166                                                             fence-paired atomic
7167                                                             has completed
7168                                                             before invalidating
7169                                                             the
7170                                                             cache. Therefore
7171                                                             any following
7172                                                             locations read must
7173                                                             be no older than
7174                                                             the value read by
7175                                                             the
7176                                                             fence-paired-atomic.
7177
7178                                                         2. buffer_wbinvl1_vol
7179
7180                                                           - Must happen before any
7181                                                             following global/generic
7182                                                             load/load
7183                                                             atomic/store/store
7184                                                             atomic/atomicrmw.
7185                                                           - Ensures that
7186                                                             following loads
7187                                                             will not see stale
7188                                                             global data.
7189
7190     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
7191                                                            vmcnt(0)
7192
7193                                                           - If TgSplit execution mode,
7194                                                             omit lgkmcnt(0).
7195                                                           - If OpenCL and
7196                                                             address space is
7197                                                             not generic, omit
7198                                                             lgkmcnt(0).
7199                                                           - However, since LLVM
7200                                                             currently has no
7201                                                             address space on
7202                                                             the fence need to
7203                                                             conservatively
7204                                                             always generate
7205                                                             (see comment for
7206                                                             previous fence).
7207                                                           - Could be split into
7208                                                             separate s_waitcnt
7209                                                             vmcnt(0) and
7210                                                             s_waitcnt
7211                                                             lgkmcnt(0) to allow
7212                                                             them to be
7213                                                             independently moved
7214                                                             according to the
7215                                                             following rules.
7216                                                           - s_waitcnt vmcnt(0)
7217                                                             must happen after
7218                                                             any preceding
7219                                                             global/generic load
7220                                                             atomic/atomicrmw
7221                                                             with an equal or
7222                                                             wider sync scope
7223                                                             and memory ordering
7224                                                             stronger than
7225                                                             unordered (this is
7226                                                             termed the
7227                                                             fence-paired-atomic).
7228                                                           - s_waitcnt lgkmcnt(0)
7229                                                             must happen after
7230                                                             any preceding
7231                                                             local/generic load
7232                                                             atomic/atomicrmw
7233                                                             with an equal or
7234                                                             wider sync scope
7235                                                             and memory ordering
7236                                                             stronger than
7237                                                             unordered (this is
7238                                                             termed the
7239                                                             fence-paired-atomic).
7240                                                           - Must happen before
7241                                                             the following buffer_invl2 and
7242                                                             buffer_wbinvl1_vol.
7243                                                           - Ensures that the
7244                                                             fence-paired atomic
7245                                                             has completed
7246                                                             before invalidating
7247                                                             the
7248                                                             cache. Therefore
7249                                                             any following
7250                                                             locations read must
7251                                                             be no older than
7252                                                             the value read by
7253                                                             the
7254                                                             fence-paired-atomic.
7255
7256                                                         2. buffer_invl2;
7257                                                            buffer_wbinvl1_vol
7258
7259                                                           - Must happen before any
7260                                                             following global/generic
7261                                                             load/load
7262                                                             atomic/store/store
7263                                                             atomic/atomicrmw.
7264                                                           - Ensures that
7265                                                             following
7266                                                             loads will not see
7267                                                             stale L1 global data,
7268                                                             nor see stale L2 MTYPE
7269                                                             NC global data.
7270                                                             MTYPE RW and CC memory will
7271                                                             never be stale in L2 due to
7272                                                             the memory probes.
7273     **Release Atomic**
7274     ------------------------------------------------------------------------------------
7275     store atomic release      - singlethread - global   1. buffer/global/flat_store
7276                               - wavefront    - generic
7277     store atomic release      - singlethread - local    *If TgSplit execution mode,
7278                               - wavefront               local address space cannot
7279                                                         be used.*
7280
7281                                                         1. ds_store
7282     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7283                                              - generic
7284                                                           - Use lgkmcnt(0) if not
7285                                                             TgSplit execution mode
7286                                                             and vmcnt(0) if TgSplit
7287                                                             execution mode.
7288                                                           - If OpenCL, omit lgkmcnt(0).
7289                                                           - s_waitcnt vmcnt(0)
7290                                                             must happen after
7291                                                             any preceding
7292                                                             global/generic load/store/
7293                                                             load atomic/store atomic/
7294                                                             atomicrmw.
7295                                                           - s_waitcnt lgkmcnt(0)
7296                                                             must happen after
7297                                                             any preceding
7298                                                             local/generic
7299                                                             load/store/load
7300                                                             atomic/store
7301                                                             atomic/atomicrmw.
7302                                                           - Must happen before
7303                                                             the following
7304                                                             store.
7305                                                           - Ensures that all
7306                                                             memory operations
7307                                                             have
7308                                                             completed before
7309                                                             performing the
7310                                                             store that is being
7311                                                             released.
7312
7313                                                         2. buffer/global/flat_store
7314     store atomic release      - workgroup    - local    *If TgSplit execution mode,
7315                                                         local address space cannot
7316                                                         be used.*
7317
7318                                                         1. ds_store
7319     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7320                                              - generic     vmcnt(0)
7321
7322                                                           - If TgSplit execution mode,
7323                                                             omit lgkmcnt(0).
7324                                                           - If OpenCL and
7325                                                             address space is
7326                                                             not generic, omit
7327                                                             lgkmcnt(0).
7328                                                           - Could be split into
7329                                                             separate s_waitcnt
7330                                                             vmcnt(0) and
7331                                                             s_waitcnt
7332                                                             lgkmcnt(0) to allow
7333                                                             them to be
7334                                                             independently moved
7335                                                             according to the
7336                                                             following rules.
7337                                                           - s_waitcnt vmcnt(0)
7338                                                             must happen after
7339                                                             any preceding
7340                                                             global/generic
7341                                                             load/store/load
7342                                                             atomic/store
7343                                                             atomic/atomicrmw.
7344                                                           - s_waitcnt lgkmcnt(0)
7345                                                             must happen after
7346                                                             any preceding
7347                                                             local/generic
7348                                                             load/store/load
7349                                                             atomic/store
7350                                                             atomic/atomicrmw.
7351                                                           - Must happen before
7352                                                             the following
7353                                                             store.
7354                                                           - Ensures that all
7355                                                             memory operations
7356                                                             to memory have
7357                                                             completed before
7358                                                             performing the
7359                                                             store that is being
7360                                                             released.
7361
7362                                                         2. buffer/global/flat_store
7363     store atomic release      - system       - global   1. buffer_wbl2
7364                                              - generic
7365                                                           - Must happen before
7366                                                             following s_waitcnt.
7367                                                           - Performs L2 writeback to
7368                                                             ensure previous
7369                                                             global/generic
7370                                                             store/atomicrmw are
7371                                                             visible at system scope.
7372
7373                                                         2. s_waitcnt lgkmcnt(0) &
7374                                                            vmcnt(0)
7375
7376                                                           - If TgSplit execution mode,
7377                                                             omit lgkmcnt(0).
7378                                                           - If OpenCL and
7379                                                             address space is
7380                                                             not generic, omit
7381                                                             lgkmcnt(0).
7382                                                           - Could be split into
7383                                                             separate s_waitcnt
7384                                                             vmcnt(0) and
7385                                                             s_waitcnt
7386                                                             lgkmcnt(0) to allow
7387                                                             them to be
7388                                                             independently moved
7389                                                             according to the
7390                                                             following rules.
7391                                                           - s_waitcnt vmcnt(0)
7392                                                             must happen after any
7393                                                             preceding
7394                                                             global/generic
7395                                                             load/store/load
7396                                                             atomic/store
7397                                                             atomic/atomicrmw.
7398                                                           - s_waitcnt lgkmcnt(0)
7399                                                             must happen after any
7400                                                             preceding
7401                                                             local/generic
7402                                                             load/store/load
7403                                                             atomic/store
7404                                                             atomic/atomicrmw.
7405                                                           - Must happen before
7406                                                             the following
7407                                                             store.
7408                                                           - Ensures that all
7409                                                             memory operations
7410                                                             to memory and the L2
7411                                                             writeback have
7412                                                             completed before
7413                                                             performing the
7414                                                             store that is being
7415                                                             released.
7416
7417                                                         3. buffer/global/flat_store
7418     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7419                               - wavefront    - generic
7420     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7421                               - wavefront               local address space cannot
7422                                                         be used.*
7423
7424                                                         1. ds_atomic
7425     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7426                                              - generic
7427                                                           - Use lgkmcnt(0) if not
7428                                                             TgSplit execution mode
7429                                                             and vmcnt(0) if TgSplit
7430                                                             execution mode.
7431                                                           - If OpenCL, omit
7432                                                             lgkmcnt(0).
7433                                                           - s_waitcnt vmcnt(0)
7434                                                             must happen after
7435                                                             any preceding
7436                                                             global/generic load/store/
7437                                                             load atomic/store atomic/
7438                                                             atomicrmw.
7439                                                           - s_waitcnt lgkmcnt(0)
7440                                                             must happen after
7441                                                             any preceding
7442                                                             local/generic
7443                                                             load/store/load
7444                                                             atomic/store
7445                                                             atomic/atomicrmw.
7446                                                           - Must happen before
7447                                                             the following
7448                                                             atomicrmw.
7449                                                           - Ensures that all
7450                                                             memory operations
7451                                                             have
7452                                                             completed before
7453                                                             performing the
7454                                                             atomicrmw that is
7455                                                             being released.
7456
7457                                                         2. buffer/global/flat_atomic
7458     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7459                                                         local address space cannot
7460                                                         be used.*
7461
7462                                                         1. ds_atomic
7463     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7464                                              - generic     vmcnt(0)
7465
7466                                                           - If TgSplit execution mode,
7467                                                             omit lgkmcnt(0).
7468                                                           - If OpenCL, omit
7469                                                             lgkmcnt(0).
7470                                                           - Could be split into
7471                                                             separate s_waitcnt
7472                                                             vmcnt(0) and
7473                                                             s_waitcnt
7474                                                             lgkmcnt(0) to allow
7475                                                             them to be
7476                                                             independently moved
7477                                                             according to the
7478                                                             following rules.
7479                                                           - s_waitcnt vmcnt(0)
7480                                                             must happen after
7481                                                             any preceding
7482                                                             global/generic
7483                                                             load/store/load
7484                                                             atomic/store
7485                                                             atomic/atomicrmw.
7486                                                           - s_waitcnt lgkmcnt(0)
7487                                                             must happen after
7488                                                             any preceding
7489                                                             local/generic
7490                                                             load/store/load
7491                                                             atomic/store
7492                                                             atomic/atomicrmw.
7493                                                           - Must happen before
7494                                                             the following
7495                                                             atomicrmw.
7496                                                           - Ensures that all
7497                                                             memory operations
7498                                                             to global and local
7499                                                             have completed
7500                                                             before performing
7501                                                             the atomicrmw that
7502                                                             is being released.
7503
7504                                                         2. buffer/global/flat_atomic
7505     atomicrmw    release      - system       - global   1. buffer_wbl2
7506                                              - generic
7507                                                           - Must happen before
7508                                                             following s_waitcnt.
7509                                                           - Performs L2 writeback to
7510                                                             ensure previous
7511                                                             global/generic
7512                                                             store/atomicrmw are
7513                                                             visible at system scope.
7514
7515                                                         2. s_waitcnt lgkmcnt(0) &
7516                                                            vmcnt(0)
7517
7518                                                           - If TgSplit execution mode,
7519                                                             omit lgkmcnt(0).
7520                                                           - If OpenCL, omit
7521                                                             lgkmcnt(0).
7522                                                           - Could be split into
7523                                                             separate s_waitcnt
7524                                                             vmcnt(0) and
7525                                                             s_waitcnt
7526                                                             lgkmcnt(0) to allow
7527                                                             them to be
7528                                                             independently moved
7529                                                             according to the
7530                                                             following rules.
7531                                                           - s_waitcnt vmcnt(0)
7532                                                             must happen after
7533                                                             any preceding
7534                                                             global/generic
7535                                                             load/store/load
7536                                                             atomic/store
7537                                                             atomic/atomicrmw.
7538                                                           - s_waitcnt lgkmcnt(0)
7539                                                             must happen after
7540                                                             any preceding
7541                                                             local/generic
7542                                                             load/store/load
7543                                                             atomic/store
7544                                                             atomic/atomicrmw.
7545                                                           - Must happen before
7546                                                             the following
7547                                                             atomicrmw.
7548                                                           - Ensures that all
7549                                                             memory operations
7550                                                             to memory and the L2
7551                                                             writeback have
7552                                                             completed before
7553                                                             performing the
7554                                                             store that is being
7555                                                             released.
7556
7557                                                         3. buffer/global/flat_atomic
7558     fence        release      - singlethread *none*     *none*
7559                               - wavefront
7560     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7561
7562                                                           - Use lgkmcnt(0) if not
7563                                                             TgSplit execution mode
7564                                                             and vmcnt(0) if TgSplit
7565                                                             execution mode.
7566                                                           - If OpenCL and
7567                                                             address space is
7568                                                             not generic, omit
7569                                                             lgkmcnt(0).
7570                                                           - If OpenCL and
7571                                                             address space is
7572                                                             local, omit
7573                                                             vmcnt(0).
7574                                                           - However, since LLVM
7575                                                             currently has no
7576                                                             address space on
7577                                                             the fence need to
7578                                                             conservatively
7579                                                             always generate. If
7580                                                             fence had an
7581                                                             address space then
7582                                                             set to address
7583                                                             space of OpenCL
7584                                                             fence flag, or to
7585                                                             generic if both
7586                                                             local and global
7587                                                             flags are
7588                                                             specified.
7589                                                           - s_waitcnt vmcnt(0)
7590                                                             must happen after
7591                                                             any preceding
7592                                                             global/generic
7593                                                             load/store/
7594                                                             load atomic/store atomic/
7595                                                             atomicrmw.
7596                                                           - s_waitcnt lgkmcnt(0)
7597                                                             must happen after
7598                                                             any preceding
7599                                                             local/generic
7600                                                             load/load
7601                                                             atomic/store/store
7602                                                             atomic/atomicrmw.
7603                                                           - Must happen before
7604                                                             any following store
7605                                                             atomic/atomicrmw
7606                                                             with an equal or
7607                                                             wider sync scope
7608                                                             and memory ordering
7609                                                             stronger than
7610                                                             unordered (this is
7611                                                             termed the
7612                                                             fence-paired-atomic).
7613                                                           - Ensures that all
7614                                                             memory operations
7615                                                             have
7616                                                             completed before
7617                                                             performing the
7618                                                             following
7619                                                             fence-paired-atomic.
7620
7621     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7622                                                            vmcnt(0)
7623
7624                                                           - If TgSplit execution mode,
7625                                                             omit lgkmcnt(0).
7626                                                           - If OpenCL and
7627                                                             address space is
7628                                                             not generic, omit
7629                                                             lgkmcnt(0).
7630                                                           - If OpenCL and
7631                                                             address space is
7632                                                             local, omit
7633                                                             vmcnt(0).
7634                                                           - However, since LLVM
7635                                                             currently has no
7636                                                             address space on
7637                                                             the fence need to
7638                                                             conservatively
7639                                                             always generate. If
7640                                                             fence had an
7641                                                             address space then
7642                                                             set to address
7643                                                             space of OpenCL
7644                                                             fence flag, or to
7645                                                             generic if both
7646                                                             local and global
7647                                                             flags are
7648                                                             specified.
7649                                                           - Could be split into
7650                                                             separate s_waitcnt
7651                                                             vmcnt(0) and
7652                                                             s_waitcnt
7653                                                             lgkmcnt(0) to allow
7654                                                             them to be
7655                                                             independently moved
7656                                                             according to the
7657                                                             following rules.
7658                                                           - s_waitcnt vmcnt(0)
7659                                                             must happen after
7660                                                             any preceding
7661                                                             global/generic
7662                                                             load/store/load
7663                                                             atomic/store
7664                                                             atomic/atomicrmw.
7665                                                           - s_waitcnt lgkmcnt(0)
7666                                                             must happen after
7667                                                             any preceding
7668                                                             local/generic
7669                                                             load/store/load
7670                                                             atomic/store
7671                                                             atomic/atomicrmw.
7672                                                           - Must happen before
7673                                                             any following store
7674                                                             atomic/atomicrmw
7675                                                             with an equal or
7676                                                             wider sync scope
7677                                                             and memory ordering
7678                                                             stronger than
7679                                                             unordered (this is
7680                                                             termed the
7681                                                             fence-paired-atomic).
7682                                                           - Ensures that all
7683                                                             memory operations
7684                                                             have
7685                                                             completed before
7686                                                             performing the
7687                                                             following
7688                                                             fence-paired-atomic.
7689
7690     fence        release      - system       *none*     1. buffer_wbl2
7691
7692                                                           - If OpenCL and
7693                                                             address space is
7694                                                             local, omit.
7695                                                           - Must happen before
7696                                                             following s_waitcnt.
7697                                                           - Performs L2 writeback to
7698                                                             ensure previous
7699                                                             global/generic
7700                                                             store/atomicrmw are
7701                                                             visible at system scope.
7702
7703                                                         2. s_waitcnt lgkmcnt(0) &
7704                                                            vmcnt(0)
7705
7706                                                           - If TgSplit execution mode,
7707                                                             omit lgkmcnt(0).
7708                                                           - If OpenCL and
7709                                                             address space is
7710                                                             not generic, omit
7711                                                             lgkmcnt(0).
7712                                                           - If OpenCL and
7713                                                             address space is
7714                                                             local, omit
7715                                                             vmcnt(0).
7716                                                           - However, since LLVM
7717                                                             currently has no
7718                                                             address space on
7719                                                             the fence need to
7720                                                             conservatively
7721                                                             always generate. If
7722                                                             fence had an
7723                                                             address space then
7724                                                             set to address
7725                                                             space of OpenCL
7726                                                             fence flag, or to
7727                                                             generic if both
7728                                                             local and global
7729                                                             flags are
7730                                                             specified.
7731                                                           - Could be split into
7732                                                             separate s_waitcnt
7733                                                             vmcnt(0) and
7734                                                             s_waitcnt
7735                                                             lgkmcnt(0) to allow
7736                                                             them to be
7737                                                             independently moved
7738                                                             according to the
7739                                                             following rules.
7740                                                           - s_waitcnt vmcnt(0)
7741                                                             must happen after
7742                                                             any preceding
7743                                                             global/generic
7744                                                             load/store/load
7745                                                             atomic/store
7746                                                             atomic/atomicrmw.
7747                                                           - s_waitcnt lgkmcnt(0)
7748                                                             must happen after
7749                                                             any preceding
7750                                                             local/generic
7751                                                             load/store/load
7752                                                             atomic/store
7753                                                             atomic/atomicrmw.
7754                                                           - Must happen before
7755                                                             any following store
7756                                                             atomic/atomicrmw
7757                                                             with an equal or
7758                                                             wider sync scope
7759                                                             and memory ordering
7760                                                             stronger than
7761                                                             unordered (this is
7762                                                             termed the
7763                                                             fence-paired-atomic).
7764                                                           - Ensures that all
7765                                                             memory operations
7766                                                             have
7767                                                             completed before
7768                                                             performing the
7769                                                             following
7770                                                             fence-paired-atomic.
7771
7772     **Acquire-Release Atomic**
7773     ------------------------------------------------------------------------------------
7774     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7775                               - wavefront    - generic
7776     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7777                               - wavefront               local address space cannot
7778                                                         be used.*
7779
7780                                                         1. ds_atomic
7781     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7782
7783                                                           - Use lgkmcnt(0) if not
7784                                                             TgSplit execution mode
7785                                                             and vmcnt(0) if TgSplit
7786                                                             execution mode.
7787                                                           - If OpenCL, omit
7788                                                             lgkmcnt(0).
7789                                                           - Must happen after
7790                                                             any preceding
7791                                                             local/generic
7792                                                             load/store/load
7793                                                             atomic/store
7794                                                             atomic/atomicrmw.
7795                                                           - s_waitcnt vmcnt(0)
7796                                                             must happen after
7797                                                             any preceding
7798                                                             global/generic load/store/
7799                                                             load atomic/store atomic/
7800                                                             atomicrmw.
7801                                                           - s_waitcnt lgkmcnt(0)
7802                                                             must happen after
7803                                                             any preceding
7804                                                             local/generic
7805                                                             load/store/load
7806                                                             atomic/store
7807                                                             atomic/atomicrmw.
7808                                                           - Must happen before
7809                                                             the following
7810                                                             atomicrmw.
7811                                                           - Ensures that all
7812                                                             memory operations
7813                                                             have
7814                                                             completed before
7815                                                             performing the
7816                                                             atomicrmw that is
7817                                                             being released.
7818
7819                                                         2. buffer/global_atomic
7820                                                         3. s_waitcnt vmcnt(0)
7821
7822                                                           - If not TgSplit execution
7823                                                             mode, omit.
7824                                                           - Must happen before
7825                                                             the following
7826                                                             buffer_wbinvl1_vol.
7827                                                           - Ensures any
7828                                                             following global
7829                                                             data read is no
7830                                                             older than the
7831                                                             atomicrmw value
7832                                                             being acquired.
7833
7834                                                         4. buffer_wbinvl1_vol
7835
7836                                                           - If not TgSplit execution
7837                                                             mode, omit.
7838                                                           - Ensures that
7839                                                             following
7840                                                             loads will not see
7841                                                             stale data.
7842
7843     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7844                                                         local address space cannot
7845                                                         be used.*
7846
7847                                                         1. ds_atomic
7848                                                         2. s_waitcnt lgkmcnt(0)
7849
7850                                                           - If OpenCL, omit.
7851                                                           - Must happen before
7852                                                             any following
7853                                                             global/generic
7854                                                             load/load
7855                                                             atomic/store/store
7856                                                             atomic/atomicrmw.
7857                                                           - Ensures any
7858                                                             following global
7859                                                             data read is no
7860                                                             older than the local load
7861                                                             atomic value being
7862                                                             acquired.
7863
7864     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7865
7866                                                           - Use lgkmcnt(0) if not
7867                                                             TgSplit execution mode
7868                                                             and vmcnt(0) if TgSplit
7869                                                             execution mode.
7870                                                           - If OpenCL, omit
7871                                                             lgkmcnt(0).
7872                                                           - s_waitcnt vmcnt(0)
7873                                                             must happen after
7874                                                             any preceding
7875                                                             global/generic load/store/
7876                                                             load atomic/store atomic/
7877                                                             atomicrmw.
7878                                                           - s_waitcnt lgkmcnt(0)
7879                                                             must happen after
7880                                                             any preceding
7881                                                             local/generic
7882                                                             load/store/load
7883                                                             atomic/store
7884                                                             atomic/atomicrmw.
7885                                                           - Must happen before
7886                                                             the following
7887                                                             atomicrmw.
7888                                                           - Ensures that all
7889                                                             memory operations
7890                                                             have
7891                                                             completed before
7892                                                             performing the
7893                                                             atomicrmw that is
7894                                                             being released.
7895
7896                                                         2. flat_atomic
7897                                                         3. s_waitcnt lgkmcnt(0) &
7898                                                            vmcnt(0)
7899
7900                                                           - If not TgSplit execution
7901                                                             mode, omit vmcnt(0).
7902                                                           - If OpenCL, omit
7903                                                             lgkmcnt(0).
7904                                                           - Must happen before
7905                                                             the following
7906                                                             buffer_wbinvl1_vol and
7907                                                             any following
7908                                                             global/generic
7909                                                             load/load
7910                                                             atomic/store/store
7911                                                             atomic/atomicrmw.
7912                                                           - Ensures any
7913                                                             following global
7914                                                             data read is no
7915                                                             older than a local load
7916                                                             atomic value being
7917                                                             acquired.
7918
7919                                                         3. buffer_wbinvl1_vol
7920
7921                                                           - If not TgSplit execution
7922                                                             mode, omit.
7923                                                           - Ensures that
7924                                                             following
7925                                                             loads will not see
7926                                                             stale data.
7927
7928     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7929                                                            vmcnt(0)
7930
7931                                                           - If TgSplit execution mode,
7932                                                             omit lgkmcnt(0).
7933                                                           - If OpenCL, omit
7934                                                             lgkmcnt(0).
7935                                                           - Could be split into
7936                                                             separate s_waitcnt
7937                                                             vmcnt(0) and
7938                                                             s_waitcnt
7939                                                             lgkmcnt(0) to allow
7940                                                             them to be
7941                                                             independently moved
7942                                                             according to the
7943                                                             following rules.
7944                                                           - s_waitcnt vmcnt(0)
7945                                                             must happen after
7946                                                             any preceding
7947                                                             global/generic
7948                                                             load/store/load
7949                                                             atomic/store
7950                                                             atomic/atomicrmw.
7951                                                           - s_waitcnt lgkmcnt(0)
7952                                                             must happen after
7953                                                             any preceding
7954                                                             local/generic
7955                                                             load/store/load
7956                                                             atomic/store
7957                                                             atomic/atomicrmw.
7958                                                           - Must happen before
7959                                                             the following
7960                                                             atomicrmw.
7961                                                           - Ensures that all
7962                                                             memory operations
7963                                                             to global have
7964                                                             completed before
7965                                                             performing the
7966                                                             atomicrmw that is
7967                                                             being released.
7968
7969                                                         2. buffer/global_atomic
7970                                                         3. s_waitcnt vmcnt(0)
7971
7972                                                           - Must happen before
7973                                                             following
7974                                                             buffer_wbinvl1_vol.
7975                                                           - Ensures the
7976                                                             atomicrmw has
7977                                                             completed before
7978                                                             invalidating the
7979                                                             cache.
7980
7981                                                         4. buffer_wbinvl1_vol
7982
7983                                                           - Must happen before
7984                                                             any following
7985                                                             global/generic
7986                                                             load/load
7987                                                             atomic/atomicrmw.
7988                                                           - Ensures that
7989                                                             following loads
7990                                                             will not see stale
7991                                                             global data.
7992
7993     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7994
7995                                                           - Must happen before
7996                                                             following s_waitcnt.
7997                                                           - Performs L2 writeback to
7998                                                             ensure previous
7999                                                             global/generic
8000                                                             store/atomicrmw are
8001                                                             visible at system scope.
8002
8003                                                         2. s_waitcnt lgkmcnt(0) &
8004                                                            vmcnt(0)
8005
8006                                                           - If TgSplit execution mode,
8007                                                             omit lgkmcnt(0).
8008                                                           - If OpenCL, omit
8009                                                             lgkmcnt(0).
8010                                                           - Could be split into
8011                                                             separate s_waitcnt
8012                                                             vmcnt(0) and
8013                                                             s_waitcnt
8014                                                             lgkmcnt(0) to allow
8015                                                             them to be
8016                                                             independently moved
8017                                                             according to the
8018                                                             following rules.
8019                                                           - s_waitcnt vmcnt(0)
8020                                                             must happen after
8021                                                             any preceding
8022                                                             global/generic
8023                                                             load/store/load
8024                                                             atomic/store
8025                                                             atomic/atomicrmw.
8026                                                           - s_waitcnt lgkmcnt(0)
8027                                                             must happen after
8028                                                             any preceding
8029                                                             local/generic
8030                                                             load/store/load
8031                                                             atomic/store
8032                                                             atomic/atomicrmw.
8033                                                           - Must happen before
8034                                                             the following
8035                                                             atomicrmw.
8036                                                           - Ensures that all
8037                                                             memory operations
8038                                                             to global and L2 writeback
8039                                                             have completed before
8040                                                             performing the
8041                                                             atomicrmw that is
8042                                                             being released.
8043
8044                                                         3. buffer/global_atomic
8045                                                         4. s_waitcnt vmcnt(0)
8046
8047                                                           - Must happen before
8048                                                             following buffer_invl2 and
8049                                                             buffer_wbinvl1_vol.
8050                                                           - Ensures the
8051                                                             atomicrmw has
8052                                                             completed before
8053                                                             invalidating the
8054                                                             caches.
8055
8056                                                         5. buffer_invl2;
8057                                                            buffer_wbinvl1_vol
8058
8059                                                           - Must happen before
8060                                                             any following
8061                                                             global/generic
8062                                                             load/load
8063                                                             atomic/atomicrmw.
8064                                                           - Ensures that
8065                                                             following
8066                                                             loads will not see
8067                                                             stale L1 global data,
8068                                                             nor see stale L2 MTYPE
8069                                                             NC global data.
8070                                                             MTYPE RW and CC memory will
8071                                                             never be stale in L2 due to
8072                                                             the memory probes.
8073
8074     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
8075                                                            vmcnt(0)
8076
8077                                                           - If TgSplit execution mode,
8078                                                             omit lgkmcnt(0).
8079                                                           - If OpenCL, omit
8080                                                             lgkmcnt(0).
8081                                                           - Could be split into
8082                                                             separate s_waitcnt
8083                                                             vmcnt(0) and
8084                                                             s_waitcnt
8085                                                             lgkmcnt(0) to allow
8086                                                             them to be
8087                                                             independently moved
8088                                                             according to the
8089                                                             following rules.
8090                                                           - s_waitcnt vmcnt(0)
8091                                                             must happen after
8092                                                             any preceding
8093                                                             global/generic
8094                                                             load/store/load
8095                                                             atomic/store
8096                                                             atomic/atomicrmw.
8097                                                           - s_waitcnt lgkmcnt(0)
8098                                                             must happen after
8099                                                             any preceding
8100                                                             local/generic
8101                                                             load/store/load
8102                                                             atomic/store
8103                                                             atomic/atomicrmw.
8104                                                           - Must happen before
8105                                                             the following
8106                                                             atomicrmw.
8107                                                           - Ensures that all
8108                                                             memory operations
8109                                                             to global have
8110                                                             completed before
8111                                                             performing the
8112                                                             atomicrmw that is
8113                                                             being released.
8114
8115                                                         2. flat_atomic
8116                                                         3. s_waitcnt vmcnt(0) &
8117                                                            lgkmcnt(0)
8118
8119                                                           - If TgSplit execution mode,
8120                                                             omit lgkmcnt(0).
8121                                                           - If OpenCL, omit
8122                                                             lgkmcnt(0).
8123                                                           - Must happen before
8124                                                             following
8125                                                             buffer_wbinvl1_vol.
8126                                                           - Ensures the
8127                                                             atomicrmw has
8128                                                             completed before
8129                                                             invalidating the
8130                                                             cache.
8131
8132                                                         4. buffer_wbinvl1_vol
8133
8134                                                           - Must happen before
8135                                                             any following
8136                                                             global/generic
8137                                                             load/load
8138                                                             atomic/atomicrmw.
8139                                                           - Ensures that
8140                                                             following loads
8141                                                             will not see stale
8142                                                             global data.
8143
8144     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
8145
8146                                                           - Must happen before
8147                                                             following s_waitcnt.
8148                                                           - Performs L2 writeback to
8149                                                             ensure previous
8150                                                             global/generic
8151                                                             store/atomicrmw are
8152                                                             visible at system scope.
8153
8154                                                         2. s_waitcnt lgkmcnt(0) &
8155                                                            vmcnt(0)
8156
8157                                                           - If TgSplit execution mode,
8158                                                             omit lgkmcnt(0).
8159                                                           - If OpenCL, omit
8160                                                             lgkmcnt(0).
8161                                                           - Could be split into
8162                                                             separate s_waitcnt
8163                                                             vmcnt(0) and
8164                                                             s_waitcnt
8165                                                             lgkmcnt(0) to allow
8166                                                             them to be
8167                                                             independently moved
8168                                                             according to the
8169                                                             following rules.
8170                                                           - s_waitcnt vmcnt(0)
8171                                                             must happen after
8172                                                             any preceding
8173                                                             global/generic
8174                                                             load/store/load
8175                                                             atomic/store
8176                                                             atomic/atomicrmw.
8177                                                           - s_waitcnt lgkmcnt(0)
8178                                                             must happen after
8179                                                             any preceding
8180                                                             local/generic
8181                                                             load/store/load
8182                                                             atomic/store
8183                                                             atomic/atomicrmw.
8184                                                           - Must happen before
8185                                                             the following
8186                                                             atomicrmw.
8187                                                           - Ensures that all
8188                                                             memory operations
8189                                                             to global and L2 writeback
8190                                                             have completed before
8191                                                             performing the
8192                                                             atomicrmw that is
8193                                                             being released.
8194
8195                                                         3. flat_atomic
8196                                                         4. s_waitcnt vmcnt(0) &
8197                                                            lgkmcnt(0)
8198
8199                                                           - If TgSplit execution mode,
8200                                                             omit lgkmcnt(0).
8201                                                           - If OpenCL, omit
8202                                                             lgkmcnt(0).
8203                                                           - Must happen before
8204                                                             following buffer_invl2 and
8205                                                             buffer_wbinvl1_vol.
8206                                                           - Ensures the
8207                                                             atomicrmw has
8208                                                             completed before
8209                                                             invalidating the
8210                                                             caches.
8211
8212                                                         5. buffer_invl2;
8213                                                            buffer_wbinvl1_vol
8214
8215                                                           - Must happen before
8216                                                             any following
8217                                                             global/generic
8218                                                             load/load
8219                                                             atomic/atomicrmw.
8220                                                           - Ensures that
8221                                                             following
8222                                                             loads will not see
8223                                                             stale L1 global data,
8224                                                             nor see stale L2 MTYPE
8225                                                             NC global data.
8226                                                             MTYPE RW and CC memory will
8227                                                             never be stale in L2 due to
8228                                                             the memory probes.
8229
8230     fence        acq_rel      - singlethread *none*     *none*
8231                               - wavefront
8232     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8233
8234                                                           - Use lgkmcnt(0) if not
8235                                                             TgSplit execution mode
8236                                                             and vmcnt(0) if TgSplit
8237                                                             execution mode.
8238                                                           - If OpenCL and
8239                                                             address space is
8240                                                             not generic, omit
8241                                                             lgkmcnt(0).
8242                                                           - If OpenCL and
8243                                                             address space is
8244                                                             local, omit
8245                                                             vmcnt(0).
8246                                                           - However,
8247                                                             since LLVM
8248                                                             currently has no
8249                                                             address space on
8250                                                             the fence need to
8251                                                             conservatively
8252                                                             always generate
8253                                                             (see comment for
8254                                                             previous fence).
8255                                                           - s_waitcnt vmcnt(0)
8256                                                             must happen after
8257                                                             any preceding
8258                                                             global/generic
8259                                                             load/store/
8260                                                             load atomic/store atomic/
8261                                                             atomicrmw.
8262                                                           - s_waitcnt lgkmcnt(0)
8263                                                             must happen after
8264                                                             any preceding
8265                                                             local/generic
8266                                                             load/load
8267                                                             atomic/store/store
8268                                                             atomic/atomicrmw.
8269                                                           - Must happen before
8270                                                             any following
8271                                                             global/generic
8272                                                             load/load
8273                                                             atomic/store/store
8274                                                             atomic/atomicrmw.
8275                                                           - Ensures that all
8276                                                             memory operations
8277                                                             have
8278                                                             completed before
8279                                                             performing any
8280                                                             following global
8281                                                             memory operations.
8282                                                           - Ensures that the
8283                                                             preceding
8284                                                             local/generic load
8285                                                             atomic/atomicrmw
8286                                                             with an equal or
8287                                                             wider sync scope
8288                                                             and memory ordering
8289                                                             stronger than
8290                                                             unordered (this is
8291                                                             termed the
8292                                                             acquire-fence-paired-atomic)
8293                                                             has completed
8294                                                             before following
8295                                                             global memory
8296                                                             operations. This
8297                                                             satisfies the
8298                                                             requirements of
8299                                                             acquire.
8300                                                           - Ensures that all
8301                                                             previous memory
8302                                                             operations have
8303                                                             completed before a
8304                                                             following
8305                                                             local/generic store
8306                                                             atomic/atomicrmw
8307                                                             with an equal or
8308                                                             wider sync scope
8309                                                             and memory ordering
8310                                                             stronger than
8311                                                             unordered (this is
8312                                                             termed the
8313                                                             release-fence-paired-atomic).
8314                                                             This satisfies the
8315                                                             requirements of
8316                                                             release.
8317                                                           - Must happen before
8318                                                             the following
8319                                                             buffer_wbinvl1_vol.
8320                                                           - Ensures that the
8321                                                             acquire-fence-paired
8322                                                             atomic has completed
8323                                                             before invalidating
8324                                                             the
8325                                                             cache. Therefore
8326                                                             any following
8327                                                             locations read must
8328                                                             be no older than
8329                                                             the value read by
8330                                                             the
8331                                                             acquire-fence-paired-atomic.
8332
8333                                                         2. buffer_wbinvl1_vol
8334
8335                                                           - If not TgSplit execution
8336                                                             mode, omit.
8337                                                           - Ensures that
8338                                                             following
8339                                                             loads will not see
8340                                                             stale data.
8341
8342     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8343                                                            vmcnt(0)
8344
8345                                                           - If TgSplit execution mode,
8346                                                             omit lgkmcnt(0).
8347                                                           - If OpenCL and
8348                                                             address space is
8349                                                             not generic, omit
8350                                                             lgkmcnt(0).
8351                                                           - However, since LLVM
8352                                                             currently has no
8353                                                             address space on
8354                                                             the fence need to
8355                                                             conservatively
8356                                                             always generate
8357                                                             (see comment for
8358                                                             previous fence).
8359                                                           - Could be split into
8360                                                             separate s_waitcnt
8361                                                             vmcnt(0) and
8362                                                             s_waitcnt
8363                                                             lgkmcnt(0) to allow
8364                                                             them to be
8365                                                             independently moved
8366                                                             according to the
8367                                                             following rules.
8368                                                           - s_waitcnt vmcnt(0)
8369                                                             must happen after
8370                                                             any preceding
8371                                                             global/generic
8372                                                             load/store/load
8373                                                             atomic/store
8374                                                             atomic/atomicrmw.
8375                                                           - s_waitcnt lgkmcnt(0)
8376                                                             must happen after
8377                                                             any preceding
8378                                                             local/generic
8379                                                             load/store/load
8380                                                             atomic/store
8381                                                             atomic/atomicrmw.
8382                                                           - Must happen before
8383                                                             the following
8384                                                             buffer_wbinvl1_vol.
8385                                                           - Ensures that the
8386                                                             preceding
8387                                                             global/local/generic
8388                                                             load
8389                                                             atomic/atomicrmw
8390                                                             with an equal or
8391                                                             wider sync scope
8392                                                             and memory ordering
8393                                                             stronger than
8394                                                             unordered (this is
8395                                                             termed the
8396                                                             acquire-fence-paired-atomic)
8397                                                             has completed
8398                                                             before invalidating
8399                                                             the cache. This
8400                                                             satisfies the
8401                                                             requirements of
8402                                                             acquire.
8403                                                           - Ensures that all
8404                                                             previous memory
8405                                                             operations have
8406                                                             completed before a
8407                                                             following
8408                                                             global/local/generic
8409                                                             store
8410                                                             atomic/atomicrmw
8411                                                             with an equal or
8412                                                             wider sync scope
8413                                                             and memory ordering
8414                                                             stronger than
8415                                                             unordered (this is
8416                                                             termed the
8417                                                             release-fence-paired-atomic).
8418                                                             This satisfies the
8419                                                             requirements of
8420                                                             release.
8421
8422                                                         2. buffer_wbinvl1_vol
8423
8424                                                           - Must happen before
8425                                                             any following
8426                                                             global/generic
8427                                                             load/load
8428                                                             atomic/store/store
8429                                                             atomic/atomicrmw.
8430                                                           - Ensures that
8431                                                             following loads
8432                                                             will not see stale
8433                                                             global data. This
8434                                                             satisfies the
8435                                                             requirements of
8436                                                             acquire.
8437
8438     fence        acq_rel      - system       *none*     1. buffer_wbl2
8439
8440                                                           - If OpenCL and
8441                                                             address space is
8442                                                             local, omit.
8443                                                           - Must happen before
8444                                                             following s_waitcnt.
8445                                                           - Performs L2 writeback to
8446                                                             ensure previous
8447                                                             global/generic
8448                                                             store/atomicrmw are
8449                                                             visible at system scope.
8450
8451                                                         2. s_waitcnt lgkmcnt(0) &
8452                                                            vmcnt(0)
8453
8454                                                           - If TgSplit execution mode,
8455                                                             omit lgkmcnt(0).
8456                                                           - If OpenCL and
8457                                                             address space is
8458                                                             not generic, omit
8459                                                             lgkmcnt(0).
8460                                                           - However, since LLVM
8461                                                             currently has no
8462                                                             address space on
8463                                                             the fence need to
8464                                                             conservatively
8465                                                             always generate
8466                                                             (see comment for
8467                                                             previous fence).
8468                                                           - Could be split into
8469                                                             separate s_waitcnt
8470                                                             vmcnt(0) and
8471                                                             s_waitcnt
8472                                                             lgkmcnt(0) to allow
8473                                                             them to be
8474                                                             independently moved
8475                                                             according to the
8476                                                             following rules.
8477                                                           - s_waitcnt vmcnt(0)
8478                                                             must happen after
8479                                                             any preceding
8480                                                             global/generic
8481                                                             load/store/load
8482                                                             atomic/store
8483                                                             atomic/atomicrmw.
8484                                                           - s_waitcnt lgkmcnt(0)
8485                                                             must happen after
8486                                                             any preceding
8487                                                             local/generic
8488                                                             load/store/load
8489                                                             atomic/store
8490                                                             atomic/atomicrmw.
8491                                                           - Must happen before
8492                                                             the following buffer_invl2 and
8493                                                             buffer_wbinvl1_vol.
8494                                                           - Ensures that the
8495                                                             preceding
8496                                                             global/local/generic
8497                                                             load
8498                                                             atomic/atomicrmw
8499                                                             with an equal or
8500                                                             wider sync scope
8501                                                             and memory ordering
8502                                                             stronger than
8503                                                             unordered (this is
8504                                                             termed the
8505                                                             acquire-fence-paired-atomic)
8506                                                             has completed
8507                                                             before invalidating
8508                                                             the cache. This
8509                                                             satisfies the
8510                                                             requirements of
8511                                                             acquire.
8512                                                           - Ensures that all
8513                                                             previous memory
8514                                                             operations have
8515                                                             completed before a
8516                                                             following
8517                                                             global/local/generic
8518                                                             store
8519                                                             atomic/atomicrmw
8520                                                             with an equal or
8521                                                             wider sync scope
8522                                                             and memory ordering
8523                                                             stronger than
8524                                                             unordered (this is
8525                                                             termed the
8526                                                             release-fence-paired-atomic).
8527                                                             This satisfies the
8528                                                             requirements of
8529                                                             release.
8530
8531                                                         3.  buffer_invl2;
8532                                                             buffer_wbinvl1_vol
8533
8534                                                           - Must happen before
8535                                                             any following
8536                                                             global/generic
8537                                                             load/load
8538                                                             atomic/store/store
8539                                                             atomic/atomicrmw.
8540                                                           - Ensures that
8541                                                             following
8542                                                             loads will not see
8543                                                             stale L1 global data,
8544                                                             nor see stale L2 MTYPE
8545                                                             NC global data.
8546                                                             MTYPE RW and CC memory will
8547                                                             never be stale in L2 due to
8548                                                             the memory probes.
8549
8550     **Sequential Consistent Atomic**
8551     ------------------------------------------------------------------------------------
8552     load atomic  seq_cst      - singlethread - global   *Same as corresponding
8553                               - wavefront    - local    load atomic acquire,
8554                                              - generic  except must generate
8555                                                         all instructions even
8556                                                         for OpenCL.*
8557     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8558                                              - generic
8559                                                           - Use lgkmcnt(0) if not
8560                                                             TgSplit execution mode
8561                                                             and vmcnt(0) if TgSplit
8562                                                             execution mode.
8563                                                           - s_waitcnt lgkmcnt(0) must
8564                                                             happen after
8565                                                             preceding
8566                                                             local/generic load
8567                                                             atomic/store
8568                                                             atomic/atomicrmw
8569                                                             with memory
8570                                                             ordering of seq_cst
8571                                                             and with equal or
8572                                                             wider sync scope.
8573                                                             (Note that seq_cst
8574                                                             fences have their
8575                                                             own s_waitcnt
8576                                                             lgkmcnt(0) and so do
8577                                                             not need to be
8578                                                             considered.)
8579                                                           - s_waitcnt vmcnt(0)
8580                                                             must happen after
8581                                                             preceding
8582                                                             global/generic load
8583                                                             atomic/store
8584                                                             atomic/atomicrmw
8585                                                             with memory
8586                                                             ordering of seq_cst
8587                                                             and with equal or
8588                                                             wider sync scope.
8589                                                             (Note that seq_cst
8590                                                             fences have their
8591                                                             own s_waitcnt
8592                                                             vmcnt(0) and so do
8593                                                             not need to be
8594                                                             considered.)
8595                                                           - Ensures any
8596                                                             preceding
8597                                                             sequential
8598                                                             consistent global/local
8599                                                             memory instructions
8600                                                             have completed
8601                                                             before executing
8602                                                             this sequentially
8603                                                             consistent
8604                                                             instruction. This
8605                                                             prevents reordering
8606                                                             a seq_cst store
8607                                                             followed by a
8608                                                             seq_cst load. (Note
8609                                                             that seq_cst is
8610                                                             stronger than
8611                                                             acquire/release as
8612                                                             the reordering of
8613                                                             load acquire
8614                                                             followed by a store
8615                                                             release is
8616                                                             prevented by the
8617                                                             s_waitcnt of
8618                                                             the release, but
8619                                                             there is nothing
8620                                                             preventing a store
8621                                                             release followed by
8622                                                             load acquire from
8623                                                             completing out of
8624                                                             order. The s_waitcnt
8625                                                             could be placed after
8626                                                             seq_store or before
8627                                                             the seq_load. We
8628                                                             choose the load to
8629                                                             make the s_waitcnt be
8630                                                             as late as possible
8631                                                             so that the store
8632                                                             may have already
8633                                                             completed.)
8634
8635                                                         2. *Following
8636                                                            instructions same as
8637                                                            corresponding load
8638                                                            atomic acquire,
8639                                                            except must generate
8640                                                            all instructions even
8641                                                            for OpenCL.*
8642     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8643                                                         local address space cannot
8644                                                         be used.*
8645
8646                                                         *Same as corresponding
8647                                                         load atomic acquire,
8648                                                         except must generate
8649                                                         all instructions even
8650                                                         for OpenCL.*
8651
8652     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8653                               - system       - generic     vmcnt(0)
8654
8655                                                           - If TgSplit execution mode,
8656                                                             omit lgkmcnt(0).
8657                                                           - Could be split into
8658                                                             separate s_waitcnt
8659                                                             vmcnt(0)
8660                                                             and s_waitcnt
8661                                                             lgkmcnt(0) to allow
8662                                                             them to be
8663                                                             independently moved
8664                                                             according to the
8665                                                             following rules.
8666                                                           - s_waitcnt lgkmcnt(0)
8667                                                             must happen after
8668                                                             preceding
8669                                                             global/generic load
8670                                                             atomic/store
8671                                                             atomic/atomicrmw
8672                                                             with memory
8673                                                             ordering of seq_cst
8674                                                             and with equal or
8675                                                             wider sync scope.
8676                                                             (Note that seq_cst
8677                                                             fences have their
8678                                                             own s_waitcnt
8679                                                             lgkmcnt(0) and so do
8680                                                             not need to be
8681                                                             considered.)
8682                                                           - s_waitcnt vmcnt(0)
8683                                                             must happen after
8684                                                             preceding
8685                                                             global/generic load
8686                                                             atomic/store
8687                                                             atomic/atomicrmw
8688                                                             with memory
8689                                                             ordering of seq_cst
8690                                                             and with equal or
8691                                                             wider sync scope.
8692                                                             (Note that seq_cst
8693                                                             fences have their
8694                                                             own s_waitcnt
8695                                                             vmcnt(0) and so do
8696                                                             not need to be
8697                                                             considered.)
8698                                                           - Ensures any
8699                                                             preceding
8700                                                             sequential
8701                                                             consistent global
8702                                                             memory instructions
8703                                                             have completed
8704                                                             before executing
8705                                                             this sequentially
8706                                                             consistent
8707                                                             instruction. This
8708                                                             prevents reordering
8709                                                             a seq_cst store
8710                                                             followed by a
8711                                                             seq_cst load. (Note
8712                                                             that seq_cst is
8713                                                             stronger than
8714                                                             acquire/release as
8715                                                             the reordering of
8716                                                             load acquire
8717                                                             followed by a store
8718                                                             release is
8719                                                             prevented by the
8720                                                             s_waitcnt of
8721                                                             the release, but
8722                                                             there is nothing
8723                                                             preventing a store
8724                                                             release followed by
8725                                                             load acquire from
8726                                                             completing out of
8727                                                             order. The s_waitcnt
8728                                                             could be placed after
8729                                                             seq_store or before
8730                                                             the seq_load. We
8731                                                             choose the load to
8732                                                             make the s_waitcnt be
8733                                                             as late as possible
8734                                                             so that the store
8735                                                             may have already
8736                                                             completed.)
8737
8738                                                         2. *Following
8739                                                            instructions same as
8740                                                            corresponding load
8741                                                            atomic acquire,
8742                                                            except must generate
8743                                                            all instructions even
8744                                                            for OpenCL.*
8745     store atomic seq_cst      - singlethread - global   *Same as corresponding
8746                               - wavefront    - local    store atomic release,
8747                               - workgroup    - generic  except must generate
8748                               - agent                   all instructions even
8749                               - system                  for OpenCL.*
8750     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8751                               - wavefront    - local    atomicrmw acq_rel,
8752                               - workgroup    - generic  except must generate
8753                               - agent                   all instructions even
8754                               - system                  for OpenCL.*
8755     fence        seq_cst      - singlethread *none*     *Same as corresponding
8756                               - wavefront               fence acq_rel,
8757                               - workgroup               except must generate
8758                               - agent                   all instructions even
8759                               - system                  for OpenCL.*
8760     ============ ============ ============== ========== ================================
8761
8762.. _amdgpu-amdhsa-memory-model-gfx940:
8763
8764Memory Model GFX940
8765+++++++++++++++++++
8766
8767For GFX940:
8768
8769* Each agent has multiple shader arrays (SA).
8770* Each SA has multiple compute units (CU).
8771* Each CU has multiple SIMDs that execute wavefronts.
8772* The wavefronts for a single work-group are executed in the same CU but may be
8773  executed by different SIMDs. The exception is when in tgsplit execution mode
8774  when the wavefronts may be executed by different SIMDs in different CUs.
8775* Each CU has a single LDS memory shared by the wavefronts of the work-groups
8776  executing on it. The exception is when in tgsplit execution mode when no LDS
8777  is allocated as wavefronts of the same work-group can be in different CUs.
8778* All LDS operations of a CU are performed as wavefront wide operations in a
8779  global order and involve no caching. Completion is reported to a wavefront in
8780  execution order.
8781* The LDS memory has multiple request queues shared by the SIMDs of a
8782  CU. Therefore, the LDS operations performed by different wavefronts of a
8783  work-group can be reordered relative to each other, which can result in
8784  reordering the visibility of vector memory operations with respect to LDS
8785  operations of other wavefronts in the same work-group. A ``s_waitcnt
8786  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8787  vector memory operations between wavefronts of a work-group, but not between
8788  operations performed by the same wavefront.
8789* The vector memory operations are performed as wavefront wide operations and
8790  completion is reported to a wavefront in execution order. The exception is
8791  that ``flat_load/store/atomic`` instructions can report out of vector memory
8792  order if they access LDS memory, and out of LDS operation order if they access
8793  global memory.
8794* The vector memory operations access a single vector L1 cache shared by all
8795  SIMDs a CU. Therefore:
8796
8797  * No special action is required for coherence between the lanes of a single
8798    wavefront.
8799
8800  * No special action is required for coherence between wavefronts in the same
8801    work-group since they execute on the same CU. The exception is when in
8802    tgsplit execution mode as wavefronts of the same work-group can be in
8803    different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8804    the L1 cache.
8805
8806  * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8807    between wavefronts executing in different work-groups as they may be
8808    executing on different CUs.
8809
8810  * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8811    Therefore, they do not use the sc0 bit for coherence and instead use it to
8812    indicate if the instruction returns the original value being updated. They
8813    do use sc1 to indicate system or agent scope coherence.
8814
8815* The scalar memory operations access a scalar L1 cache shared by all wavefronts
8816  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8817  scalar operations are used in a restricted way so do not impact the memory
8818  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8819* The vector and scalar memory operations use an L2 cache.
8820
8821  * The gfx940 can be configured as a number of smaller agents with each having
8822    a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8823    larger agents with groups of CUs on each agent each sharing separate L2
8824    caches.
8825  * The L2 cache has independent channels to service disjoint ranges of virtual
8826    addresses.
8827  * Each CU has a separate request queue per channel for its associated L2.
8828    Therefore, the vector and scalar memory operations performed by wavefronts
8829    executing with different L1 caches and the same L2 cache can be reordered
8830    relative to each other.
8831  * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8832    vector memory operations of different CUs. It ensures a previous vector
8833    memory operation has completed before executing a subsequent vector memory
8834    or LDS operation and so can be used to meet the requirements of acquire and
8835    release.
8836  * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8837    (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8838    the PTE C-bit set for memory not local to the L2.
8839
8840    * Any local memory cache lines will be automatically invalidated by writes
8841      from CUs associated with other L2 caches, or writes from the CPU, due to
8842      the cache probe caused by the PTE C-bit.
8843    * XGMI accesses from the CPU to local memory may be cached on the CPU.
8844      Subsequent access from the GPU will automatically invalidate or writeback
8845      the CPU cache due to the L2 probe filter.
8846    * To ensure coherence of local memory writes of CUs with different L1 caches
8847      in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8848      agent is configured to have a single L2, or will writeback dirty L2 cache
8849      lines if configured to have multiple L2 caches.
8850    * To ensure coherence of local memory writes of CUs in different agents a
8851      ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8852    * To ensure coherence of local memory reads of CUs with different L1 caches
8853      in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8854      agent is configured to have a single L2, or will invalidate non-local L2
8855      cache lines if configured to have multiple L2 caches.
8856    * To ensure coherence of local memory reads of CUs in different agents a
8857      ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8858      lines if configured to have multiple L2 caches.
8859
8860  * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8861    UC (uncached) which bypasses the L2.
8862
8863Scalar memory operations are only used to access memory that is proven to not
8864change during the execution of the kernel dispatch. This includes constant
8865address space and global address space for program scope ``const`` variables.
8866Therefore, the kernel machine code does not have to maintain the scalar cache to
8867ensure it is coherent with the vector caches. The scalar and vector caches are
8868invalidated between kernel dispatches by CP since constant address space data
8869may change between kernel dispatch executions. See
8870:ref:`amdgpu-amdhsa-memory-spaces`.
8871
8872The one exception is if scalar writes are used to spill SGPR registers. In this
8873case the AMDGPU backend ensures the memory location used to spill is never
8874accessed by vector memory operations at the same time. If scalar writes are used
8875then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8876return since the locations may be used for vector memory instructions by a
8877future wavefront that uses the same scratch area, or a function call that
8878creates a frame at the same address, respectively. There is no need for a
8879``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8880
8881For kernarg backing memory:
8882
8883* CP invalidates the L1 cache at the start of each kernel dispatch.
8884* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8885  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8886  cache. This also causes it to be treated as non-volatile and so is not
8887  invalidated by ``*_vol``.
8888* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8889  so the L2 cache will be coherent with the CPU and other agents.
8890
8891Scratch backing memory (which is used for the private address space) is accessed
8892with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8893only accessed by a single thread, and is always write-before-read, there is
8894never a need to invalidate these entries from the L1 cache. Hence all cache
8895invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8896
8897The code sequences used to implement the memory model for GFX940 are defined
8898in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8899
8900  .. table:: AMDHSA Memory Model Code Sequences GFX940
8901     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8902
8903     ============ ============ ============== ========== ================================
8904     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8905                  Ordering     Sync Scope     Address    GFX940
8906                                              Space
8907     ============ ============ ============== ========== ================================
8908     **Non-Atomic**
8909     ------------------------------------------------------------------------------------
8910     load         *none*       *none*         - global   - !volatile & !nontemporal
8911                                              - generic
8912                                              - private    1. buffer/global/flat_load
8913                                              - constant
8914                                                         - !volatile & nontemporal
8915
8916                                                           1. buffer/global/flat_load
8917                                                              nt=1
8918
8919                                                         - volatile
8920
8921                                                           1. buffer/global/flat_load
8922                                                              sc0=1 sc1=1
8923                                                           2. s_waitcnt vmcnt(0)
8924
8925                                                            - Must happen before
8926                                                              any following volatile
8927                                                              global/generic
8928                                                              load/store.
8929                                                            - Ensures that
8930                                                              volatile
8931                                                              operations to
8932                                                              different
8933                                                              addresses will not
8934                                                              be reordered by
8935                                                              hardware.
8936
8937     load         *none*       *none*         - local    1. ds_load
8938     store        *none*       *none*         - global   - !volatile & !nontemporal
8939                                              - generic
8940                                              - private    1. buffer/global/flat_store
8941                                              - constant
8942                                                         - !volatile & nontemporal
8943
8944                                                           1. buffer/global/flat_store
8945                                                              nt=1
8946
8947                                                         - volatile
8948
8949                                                           1. buffer/global/flat_store
8950                                                              sc0=1 sc1=1
8951                                                           2. s_waitcnt vmcnt(0)
8952
8953                                                            - Must happen before
8954                                                              any following volatile
8955                                                              global/generic
8956                                                              load/store.
8957                                                            - Ensures that
8958                                                              volatile
8959                                                              operations to
8960                                                              different
8961                                                              addresses will not
8962                                                              be reordered by
8963                                                              hardware.
8964
8965     store        *none*       *none*         - local    1. ds_store
8966     **Unordered Atomic**
8967     ------------------------------------------------------------------------------------
8968     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8969     store atomic unordered    *any*          *any*      *Same as non-atomic*.
8970     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8971     **Monotonic Atomic**
8972     ------------------------------------------------------------------------------------
8973     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8974                               - wavefront    - generic
8975     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8976                                              - generic     sc0=1
8977     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
8978                               - wavefront               local address space cannot
8979                               - workgroup               be used.*
8980
8981                                                         1. ds_load
8982     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8983                                              - generic     sc1=1
8984     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
8985                                              - generic     sc0=1 sc1=1
8986     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8987                               - wavefront    - generic
8988     store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
8989                                              - generic     sc0=1
8990     store atomic monotonic    - agent        - global   1. buffer/global/flat_store
8991                                              - generic     sc1=1
8992     store atomic monotonic    - system       - global   1. buffer/global/flat_store
8993                                              - generic     sc0=1 sc1=1
8994     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
8995                               - wavefront               local address space cannot
8996                               - workgroup               be used.*
8997
8998                                                         1. ds_store
8999     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
9000                               - wavefront    - generic
9001                               - workgroup
9002                               - agent
9003     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
9004                                              - generic     sc1=1
9005     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
9006                               - wavefront               local address space cannot
9007                               - workgroup               be used.*
9008
9009                                                         1. ds_atomic
9010     **Acquire Atomic**
9011     ------------------------------------------------------------------------------------
9012     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
9013                               - wavefront    - local
9014                                              - generic
9015     load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
9016                                                         2. s_waitcnt vmcnt(0)
9017
9018                                                           - If not TgSplit execution
9019                                                             mode, omit.
9020                                                           - Must happen before the
9021                                                             following buffer_inv.
9022
9023                                                         3. buffer_inv sc0=1
9024
9025                                                           - If not TgSplit execution
9026                                                             mode, omit.
9027                                                           - Must happen before
9028                                                             any following
9029                                                             global/generic
9030                                                             load/load
9031                                                             atomic/store/store
9032                                                             atomic/atomicrmw.
9033                                                           - Ensures that
9034                                                             following
9035                                                             loads will not see
9036                                                             stale data.
9037
9038     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
9039                                                         local address space cannot
9040                                                         be used.*
9041
9042                                                         1. ds_load
9043                                                         2. s_waitcnt lgkmcnt(0)
9044
9045                                                           - If OpenCL, omit.
9046                                                           - Must happen before
9047                                                             any following
9048                                                             global/generic
9049                                                             load/load
9050                                                             atomic/store/store
9051                                                             atomic/atomicrmw.
9052                                                           - Ensures any
9053                                                             following global
9054                                                             data read is no
9055                                                             older than the local load
9056                                                             atomic value being
9057                                                             acquired.
9058
9059     load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
9060                                                         2. s_waitcnt lgkm/vmcnt(0)
9061
9062                                                           - Use lgkmcnt(0) if not
9063                                                             TgSplit execution mode
9064                                                             and vmcnt(0) if TgSplit
9065                                                             execution mode.
9066                                                           - If OpenCL, omit lgkmcnt(0).
9067                                                           - Must happen before
9068                                                             the following
9069                                                             buffer_inv and any
9070                                                             following global/generic
9071                                                             load/load
9072                                                             atomic/store/store
9073                                                             atomic/atomicrmw.
9074                                                           - Ensures any
9075                                                             following global
9076                                                             data read is no
9077                                                             older than a local load
9078                                                             atomic value being
9079                                                             acquired.
9080
9081                                                         3. buffer_inv sc0=1
9082
9083                                                           - If not TgSplit execution
9084                                                             mode, omit.
9085                                                           - Ensures that
9086                                                             following
9087                                                             loads will not see
9088                                                             stale data.
9089
9090     load atomic  acquire      - agent        - global   1. buffer/global_load
9091                                                            sc1=1
9092                                                         2. s_waitcnt vmcnt(0)
9093
9094                                                           - Must happen before
9095                                                             following
9096                                                             buffer_inv.
9097                                                           - Ensures the load
9098                                                             has completed
9099                                                             before invalidating
9100                                                             the cache.
9101
9102                                                         3. buffer_inv sc1=1
9103
9104                                                           - Must happen before
9105                                                             any following
9106                                                             global/generic
9107                                                             load/load
9108                                                             atomic/atomicrmw.
9109                                                           - Ensures that
9110                                                             following
9111                                                             loads will not see
9112                                                             stale global data.
9113
9114     load atomic  acquire      - system       - global   1. buffer/global/flat_load
9115                                                            sc0=1 sc1=1
9116                                                         2. s_waitcnt vmcnt(0)
9117
9118                                                           - Must happen before
9119                                                             following
9120                                                             buffer_inv.
9121                                                           - Ensures the load
9122                                                             has completed
9123                                                             before invalidating
9124                                                             the cache.
9125
9126                                                         3. buffer_inv sc0=1 sc1=1
9127
9128                                                           - Must happen before
9129                                                             any following
9130                                                             global/generic
9131                                                             load/load
9132                                                             atomic/atomicrmw.
9133                                                           - Ensures that
9134                                                             following
9135                                                             loads will not see
9136                                                             stale MTYPE NC global data.
9137                                                             MTYPE RW and CC memory will
9138                                                             never be stale due to the
9139                                                             memory probes.
9140
9141     load atomic  acquire      - agent        - generic  1. flat_load sc1=1
9142                                                         2. s_waitcnt vmcnt(0) &
9143                                                            lgkmcnt(0)
9144
9145                                                           - If TgSplit execution mode,
9146                                                             omit lgkmcnt(0).
9147                                                           - If OpenCL omit
9148                                                             lgkmcnt(0).
9149                                                           - Must happen before
9150                                                             following
9151                                                             buffer_inv.
9152                                                           - Ensures the flat_load
9153                                                             has completed
9154                                                             before invalidating
9155                                                             the cache.
9156
9157                                                         3. buffer_inv sc1=1
9158
9159                                                           - Must happen before
9160                                                             any following
9161                                                             global/generic
9162                                                             load/load
9163                                                             atomic/atomicrmw.
9164                                                           - Ensures that
9165                                                             following loads
9166                                                             will not see stale
9167                                                             global data.
9168
9169     load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
9170                                                         2. s_waitcnt vmcnt(0) &
9171                                                            lgkmcnt(0)
9172
9173                                                           - If TgSplit execution mode,
9174                                                             omit lgkmcnt(0).
9175                                                           - If OpenCL omit
9176                                                             lgkmcnt(0).
9177                                                           - Must happen before
9178                                                             the following
9179                                                             buffer_inv.
9180                                                           - Ensures the flat_load
9181                                                             has completed
9182                                                             before invalidating
9183                                                             the caches.
9184
9185                                                         3. buffer_inv sc0=1 sc1=1
9186
9187                                                           - Must happen before
9188                                                             any following
9189                                                             global/generic
9190                                                             load/load
9191                                                             atomic/atomicrmw.
9192                                                           - Ensures that
9193                                                             following
9194                                                             loads will not see
9195                                                             stale MTYPE NC global data.
9196                                                             MTYPE RW and CC memory will
9197                                                             never be stale due to the
9198                                                             memory probes.
9199
9200     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
9201                               - wavefront    - generic
9202     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
9203                               - wavefront               local address space cannot
9204                                                         be used.*
9205
9206                                                         1. ds_atomic
9207     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
9208                                                         2. s_waitcnt vmcnt(0)
9209
9210                                                           - If not TgSplit execution
9211                                                             mode, omit.
9212                                                           - Must happen before the
9213                                                             following buffer_inv.
9214                                                           - Ensures the atomicrmw
9215                                                             has completed
9216                                                             before invalidating
9217                                                             the cache.
9218
9219                                                         3. buffer_inv sc0=1
9220
9221                                                           - If not TgSplit execution
9222                                                             mode, omit.
9223                                                           - Must happen before
9224                                                             any following
9225                                                             global/generic
9226                                                             load/load
9227                                                             atomic/atomicrmw.
9228                                                           - Ensures that
9229                                                             following loads
9230                                                             will not see stale
9231                                                             global data.
9232
9233     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
9234                                                         local address space cannot
9235                                                         be used.*
9236
9237                                                         1. ds_atomic
9238                                                         2. s_waitcnt lgkmcnt(0)
9239
9240                                                           - If OpenCL, omit.
9241                                                           - Must happen before
9242                                                             any following
9243                                                             global/generic
9244                                                             load/load
9245                                                             atomic/store/store
9246                                                             atomic/atomicrmw.
9247                                                           - Ensures any
9248                                                             following global
9249                                                             data read is no
9250                                                             older than the local
9251                                                             atomicrmw value
9252                                                             being acquired.
9253
9254     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
9255                                                         2. s_waitcnt lgkm/vmcnt(0)
9256
9257                                                           - Use lgkmcnt(0) if not
9258                                                             TgSplit execution mode
9259                                                             and vmcnt(0) if TgSplit
9260                                                             execution mode.
9261                                                           - If OpenCL, omit lgkmcnt(0).
9262                                                           - Must happen before
9263                                                             the following
9264                                                             buffer_inv and
9265                                                             any following
9266                                                             global/generic
9267                                                             load/load
9268                                                             atomic/store/store
9269                                                             atomic/atomicrmw.
9270                                                           - Ensures any
9271                                                             following global
9272                                                             data read is no
9273                                                             older than a local
9274                                                             atomicrmw value
9275                                                             being acquired.
9276
9277                                                         3. buffer_inv sc0=1
9278
9279                                                           - If not TgSplit execution
9280                                                             mode, omit.
9281                                                           - Ensures that
9282                                                             following
9283                                                             loads will not see
9284                                                             stale data.
9285
9286     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
9287                                                         2. s_waitcnt vmcnt(0)
9288
9289                                                           - Must happen before
9290                                                             following
9291                                                             buffer_inv.
9292                                                           - Ensures the
9293                                                             atomicrmw has
9294                                                             completed before
9295                                                             invalidating the
9296                                                             cache.
9297
9298                                                         3. buffer_inv sc1=1
9299
9300                                                           - Must happen before
9301                                                             any following
9302                                                             global/generic
9303                                                             load/load
9304                                                             atomic/atomicrmw.
9305                                                           - Ensures that
9306                                                             following loads
9307                                                             will not see stale
9308                                                             global data.
9309
9310     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
9311                                                            sc1=1
9312                                                         2. s_waitcnt vmcnt(0)
9313
9314                                                           - Must happen before
9315                                                             following
9316                                                             buffer_inv.
9317                                                           - Ensures the
9318                                                             atomicrmw has
9319                                                             completed before
9320                                                             invalidating the
9321                                                             caches.
9322
9323                                                         3. buffer_inv sc0=1 sc1=1
9324
9325                                                           - Must happen before
9326                                                             any following
9327                                                             global/generic
9328                                                             load/load
9329                                                             atomic/atomicrmw.
9330                                                           - Ensures that
9331                                                             following
9332                                                             loads will not see
9333                                                             stale MTYPE NC global data.
9334                                                             MTYPE RW and CC memory will
9335                                                             never be stale due to the
9336                                                             memory probes.
9337
9338     atomicrmw    acquire      - agent        - generic  1. flat_atomic
9339                                                         2. s_waitcnt vmcnt(0) &
9340                                                            lgkmcnt(0)
9341
9342                                                           - If TgSplit execution mode,
9343                                                             omit lgkmcnt(0).
9344                                                           - If OpenCL, omit
9345                                                             lgkmcnt(0).
9346                                                           - Must happen before
9347                                                             following
9348                                                             buffer_inv.
9349                                                           - Ensures the
9350                                                             atomicrmw has
9351                                                             completed before
9352                                                             invalidating the
9353                                                             cache.
9354
9355                                                         3. buffer_inv sc1=1
9356
9357                                                           - Must happen before
9358                                                             any following
9359                                                             global/generic
9360                                                             load/load
9361                                                             atomic/atomicrmw.
9362                                                           - Ensures that
9363                                                             following loads
9364                                                             will not see stale
9365                                                             global data.
9366
9367     atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
9368                                                         2. s_waitcnt vmcnt(0) &
9369                                                            lgkmcnt(0)
9370
9371                                                           - If TgSplit execution mode,
9372                                                             omit lgkmcnt(0).
9373                                                           - If OpenCL, omit
9374                                                             lgkmcnt(0).
9375                                                           - Must happen before
9376                                                             following
9377                                                             buffer_inv.
9378                                                           - Ensures the
9379                                                             atomicrmw has
9380                                                             completed before
9381                                                             invalidating the
9382                                                             caches.
9383
9384                                                         3. buffer_inv sc0=1 sc1=1
9385
9386                                                           - Must happen before
9387                                                             any following
9388                                                             global/generic
9389                                                             load/load
9390                                                             atomic/atomicrmw.
9391                                                           - Ensures that
9392                                                             following
9393                                                             loads will not see
9394                                                             stale MTYPE NC global data.
9395                                                             MTYPE RW and CC memory will
9396                                                             never be stale due to the
9397                                                             memory probes.
9398
9399     fence        acquire      - singlethread *none*     *none*
9400                               - wavefront
9401     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9402
9403                                                           - Use lgkmcnt(0) if not
9404                                                             TgSplit execution mode
9405                                                             and vmcnt(0) if TgSplit
9406                                                             execution mode.
9407                                                           - If OpenCL and
9408                                                             address space is
9409                                                             not generic, omit
9410                                                             lgkmcnt(0).
9411                                                           - If OpenCL and
9412                                                             address space is
9413                                                             local, omit
9414                                                             vmcnt(0).
9415                                                           - However, since LLVM
9416                                                             currently has no
9417                                                             address space on
9418                                                             the fence need to
9419                                                             conservatively
9420                                                             always generate. If
9421                                                             fence had an
9422                                                             address space then
9423                                                             set to address
9424                                                             space of OpenCL
9425                                                             fence flag, or to
9426                                                             generic if both
9427                                                             local and global
9428                                                             flags are
9429                                                             specified.
9430                                                           - s_waitcnt vmcnt(0)
9431                                                             must happen after
9432                                                             any preceding
9433                                                             global/generic load
9434                                                             atomic/
9435                                                             atomicrmw
9436                                                             with an equal or
9437                                                             wider sync scope
9438                                                             and memory ordering
9439                                                             stronger than
9440                                                             unordered (this is
9441                                                             termed the
9442                                                             fence-paired-atomic).
9443                                                           - s_waitcnt lgkmcnt(0)
9444                                                             must happen after
9445                                                             any preceding
9446                                                             local/generic load
9447                                                             atomic/atomicrmw
9448                                                             with an equal or
9449                                                             wider sync scope
9450                                                             and memory ordering
9451                                                             stronger than
9452                                                             unordered (this is
9453                                                             termed the
9454                                                             fence-paired-atomic).
9455                                                           - Must happen before
9456                                                             the following
9457                                                             buffer_inv and
9458                                                             any following
9459                                                             global/generic
9460                                                             load/load
9461                                                             atomic/store/store
9462                                                             atomic/atomicrmw.
9463                                                           - Ensures any
9464                                                             following global
9465                                                             data read is no
9466                                                             older than the
9467                                                             value read by the
9468                                                             fence-paired-atomic.
9469
9470                                                         3. buffer_inv sc0=1
9471
9472                                                           - If not TgSplit execution
9473                                                             mode, omit.
9474                                                           - Ensures that
9475                                                             following
9476                                                             loads will not see
9477                                                             stale data.
9478
9479     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9480                                                            vmcnt(0)
9481
9482                                                           - If TgSplit execution mode,
9483                                                             omit lgkmcnt(0).
9484                                                           - If OpenCL and
9485                                                             address space is
9486                                                             not generic, omit
9487                                                             lgkmcnt(0).
9488                                                           - However, since LLVM
9489                                                             currently has no
9490                                                             address space on
9491                                                             the fence need to
9492                                                             conservatively
9493                                                             always generate
9494                                                             (see comment for
9495                                                             previous fence).
9496                                                           - Could be split into
9497                                                             separate s_waitcnt
9498                                                             vmcnt(0) and
9499                                                             s_waitcnt
9500                                                             lgkmcnt(0) to allow
9501                                                             them to be
9502                                                             independently moved
9503                                                             according to the
9504                                                             following rules.
9505                                                           - s_waitcnt vmcnt(0)
9506                                                             must happen after
9507                                                             any preceding
9508                                                             global/generic load
9509                                                             atomic/atomicrmw
9510                                                             with an equal or
9511                                                             wider sync scope
9512                                                             and memory ordering
9513                                                             stronger than
9514                                                             unordered (this is
9515                                                             termed the
9516                                                             fence-paired-atomic).
9517                                                           - s_waitcnt lgkmcnt(0)
9518                                                             must happen after
9519                                                             any preceding
9520                                                             local/generic load
9521                                                             atomic/atomicrmw
9522                                                             with an equal or
9523                                                             wider sync scope
9524                                                             and memory ordering
9525                                                             stronger than
9526                                                             unordered (this is
9527                                                             termed the
9528                                                             fence-paired-atomic).
9529                                                           - Must happen before
9530                                                             the following
9531                                                             buffer_inv.
9532                                                           - Ensures that the
9533                                                             fence-paired atomic
9534                                                             has completed
9535                                                             before invalidating
9536                                                             the
9537                                                             cache. Therefore
9538                                                             any following
9539                                                             locations read must
9540                                                             be no older than
9541                                                             the value read by
9542                                                             the
9543                                                             fence-paired-atomic.
9544
9545                                                         2. buffer_inv sc1=1
9546
9547                                                           - Must happen before any
9548                                                             following global/generic
9549                                                             load/load
9550                                                             atomic/store/store
9551                                                             atomic/atomicrmw.
9552                                                           - Ensures that
9553                                                             following loads
9554                                                             will not see stale
9555                                                             global data.
9556
9557     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
9558                                                            vmcnt(0)
9559
9560                                                           - If TgSplit execution mode,
9561                                                             omit lgkmcnt(0).
9562                                                           - If OpenCL and
9563                                                             address space is
9564                                                             not generic, omit
9565                                                             lgkmcnt(0).
9566                                                           - However, since LLVM
9567                                                             currently has no
9568                                                             address space on
9569                                                             the fence need to
9570                                                             conservatively
9571                                                             always generate
9572                                                             (see comment for
9573                                                             previous fence).
9574                                                           - Could be split into
9575                                                             separate s_waitcnt
9576                                                             vmcnt(0) and
9577                                                             s_waitcnt
9578                                                             lgkmcnt(0) to allow
9579                                                             them to be
9580                                                             independently moved
9581                                                             according to the
9582                                                             following rules.
9583                                                           - s_waitcnt vmcnt(0)
9584                                                             must happen after
9585                                                             any preceding
9586                                                             global/generic load
9587                                                             atomic/atomicrmw
9588                                                             with an equal or
9589                                                             wider sync scope
9590                                                             and memory ordering
9591                                                             stronger than
9592                                                             unordered (this is
9593                                                             termed the
9594                                                             fence-paired-atomic).
9595                                                           - s_waitcnt lgkmcnt(0)
9596                                                             must happen after
9597                                                             any preceding
9598                                                             local/generic load
9599                                                             atomic/atomicrmw
9600                                                             with an equal or
9601                                                             wider sync scope
9602                                                             and memory ordering
9603                                                             stronger than
9604                                                             unordered (this is
9605                                                             termed the
9606                                                             fence-paired-atomic).
9607                                                           - Must happen before
9608                                                             the following
9609                                                             buffer_inv.
9610                                                           - Ensures that the
9611                                                             fence-paired atomic
9612                                                             has completed
9613                                                             before invalidating
9614                                                             the
9615                                                             cache. Therefore
9616                                                             any following
9617                                                             locations read must
9618                                                             be no older than
9619                                                             the value read by
9620                                                             the
9621                                                             fence-paired-atomic.
9622
9623                                                         2. buffer_inv sc0=1 sc1=1
9624
9625                                                           - Must happen before any
9626                                                             following global/generic
9627                                                             load/load
9628                                                             atomic/store/store
9629                                                             atomic/atomicrmw.
9630                                                           - Ensures that
9631                                                             following loads
9632                                                             will not see stale
9633                                                             global data.
9634
9635     **Release Atomic**
9636     ------------------------------------------------------------------------------------
9637     store atomic release      - singlethread - global   1. buffer/global/flat_store
9638                               - wavefront    - generic
9639     store atomic release      - singlethread - local    *If TgSplit execution mode,
9640                               - wavefront               local address space cannot
9641                                                         be used.*
9642
9643                                                         1. ds_store
9644     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9645                                              - generic
9646                                                           - Use lgkmcnt(0) if not
9647                                                             TgSplit execution mode
9648                                                             and vmcnt(0) if TgSplit
9649                                                             execution mode.
9650                                                           - If OpenCL, omit lgkmcnt(0).
9651                                                           - s_waitcnt vmcnt(0)
9652                                                             must happen after
9653                                                             any preceding
9654                                                             global/generic load/store/
9655                                                             load atomic/store atomic/
9656                                                             atomicrmw.
9657                                                           - s_waitcnt lgkmcnt(0)
9658                                                             must happen after
9659                                                             any preceding
9660                                                             local/generic
9661                                                             load/store/load
9662                                                             atomic/store
9663                                                             atomic/atomicrmw.
9664                                                           - Must happen before
9665                                                             the following
9666                                                             store.
9667                                                           - Ensures that all
9668                                                             memory operations
9669                                                             have
9670                                                             completed before
9671                                                             performing the
9672                                                             store that is being
9673                                                             released.
9674
9675                                                         2. buffer/global/flat_store sc0=1
9676     store atomic release      - workgroup    - local    *If TgSplit execution mode,
9677                                                         local address space cannot
9678                                                         be used.*
9679
9680                                                         1. ds_store
9681     store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
9682                                              - generic
9683                                                           - Must happen before
9684                                                             following s_waitcnt.
9685                                                           - Performs L2 writeback to
9686                                                             ensure previous
9687                                                             global/generic
9688                                                             store/atomicrmw are
9689                                                             visible at agent scope.
9690
9691                                                         2. s_waitcnt lgkmcnt(0) &
9692                                                            vmcnt(0)
9693
9694                                                           - If TgSplit execution mode,
9695                                                             omit lgkmcnt(0).
9696                                                           - If OpenCL and
9697                                                             address space is
9698                                                             not generic, omit
9699                                                             lgkmcnt(0).
9700                                                           - Could be split into
9701                                                             separate s_waitcnt
9702                                                             vmcnt(0) and
9703                                                             s_waitcnt
9704                                                             lgkmcnt(0) to allow
9705                                                             them to be
9706                                                             independently moved
9707                                                             according to the
9708                                                             following rules.
9709                                                           - s_waitcnt vmcnt(0)
9710                                                             must happen after
9711                                                             any preceding
9712                                                             global/generic
9713                                                             load/store/load
9714                                                             atomic/store
9715                                                             atomic/atomicrmw.
9716                                                           - s_waitcnt lgkmcnt(0)
9717                                                             must happen after
9718                                                             any preceding
9719                                                             local/generic
9720                                                             load/store/load
9721                                                             atomic/store
9722                                                             atomic/atomicrmw.
9723                                                           - Must happen before
9724                                                             the following
9725                                                             store.
9726                                                           - Ensures that all
9727                                                             memory operations
9728                                                             to memory have
9729                                                             completed before
9730                                                             performing the
9731                                                             store that is being
9732                                                             released.
9733
9734                                                         3. buffer/global/flat_store sc1=1
9735     store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9736                                              - generic
9737                                                           - Must happen before
9738                                                             following s_waitcnt.
9739                                                           - Performs L2 writeback to
9740                                                             ensure previous
9741                                                             global/generic
9742                                                             store/atomicrmw are
9743                                                             visible at system scope.
9744
9745                                                         2. s_waitcnt lgkmcnt(0) &
9746                                                            vmcnt(0)
9747
9748                                                           - If TgSplit execution mode,
9749                                                             omit lgkmcnt(0).
9750                                                           - If OpenCL and
9751                                                             address space is
9752                                                             not generic, omit
9753                                                             lgkmcnt(0).
9754                                                           - Could be split into
9755                                                             separate s_waitcnt
9756                                                             vmcnt(0) and
9757                                                             s_waitcnt
9758                                                             lgkmcnt(0) to allow
9759                                                             them to be
9760                                                             independently moved
9761                                                             according to the
9762                                                             following rules.
9763                                                           - s_waitcnt vmcnt(0)
9764                                                             must happen after any
9765                                                             preceding
9766                                                             global/generic
9767                                                             load/store/load
9768                                                             atomic/store
9769                                                             atomic/atomicrmw.
9770                                                           - s_waitcnt lgkmcnt(0)
9771                                                             must happen after any
9772                                                             preceding
9773                                                             local/generic
9774                                                             load/store/load
9775                                                             atomic/store
9776                                                             atomic/atomicrmw.
9777                                                           - Must happen before
9778                                                             the following
9779                                                             store.
9780                                                           - Ensures that all
9781                                                             memory operations
9782                                                             to memory and the L2
9783                                                             writeback have
9784                                                             completed before
9785                                                             performing the
9786                                                             store that is being
9787                                                             released.
9788
9789                                                         3. buffer/global/flat_store
9790                                                            sc0=1 sc1=1
9791     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
9792                               - wavefront    - generic
9793     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
9794                               - wavefront               local address space cannot
9795                                                         be used.*
9796
9797                                                         1. ds_atomic
9798     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9799                                              - generic
9800                                                           - Use lgkmcnt(0) if not
9801                                                             TgSplit execution mode
9802                                                             and vmcnt(0) if TgSplit
9803                                                             execution mode.
9804                                                           - If OpenCL, omit
9805                                                             lgkmcnt(0).
9806                                                           - s_waitcnt vmcnt(0)
9807                                                             must happen after
9808                                                             any preceding
9809                                                             global/generic load/store/
9810                                                             load atomic/store atomic/
9811                                                             atomicrmw.
9812                                                           - s_waitcnt lgkmcnt(0)
9813                                                             must happen after
9814                                                             any preceding
9815                                                             local/generic
9816                                                             load/store/load
9817                                                             atomic/store
9818                                                             atomic/atomicrmw.
9819                                                           - Must happen before
9820                                                             the following
9821                                                             atomicrmw.
9822                                                           - Ensures that all
9823                                                             memory operations
9824                                                             have
9825                                                             completed before
9826                                                             performing the
9827                                                             atomicrmw that is
9828                                                             being released.
9829
9830                                                         2. buffer/global/flat_atomic sc0=1
9831     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
9832                                                         local address space cannot
9833                                                         be used.*
9834
9835                                                         1. ds_atomic
9836     atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
9837                                              - generic
9838                                                           - Must happen before
9839                                                             following s_waitcnt.
9840                                                           - Performs L2 writeback to
9841                                                             ensure previous
9842                                                             global/generic
9843                                                             store/atomicrmw are
9844                                                             visible at agent scope.
9845
9846                                                         2. s_waitcnt lgkmcnt(0) &
9847                                                            vmcnt(0)
9848
9849                                                           - If TgSplit execution mode,
9850                                                             omit lgkmcnt(0).
9851                                                           - If OpenCL, omit
9852                                                             lgkmcnt(0).
9853                                                           - Could be split into
9854                                                             separate s_waitcnt
9855                                                             vmcnt(0) and
9856                                                             s_waitcnt
9857                                                             lgkmcnt(0) to allow
9858                                                             them to be
9859                                                             independently moved
9860                                                             according to the
9861                                                             following rules.
9862                                                           - s_waitcnt vmcnt(0)
9863                                                             must happen after
9864                                                             any preceding
9865                                                             global/generic
9866                                                             load/store/load
9867                                                             atomic/store
9868                                                             atomic/atomicrmw.
9869                                                           - s_waitcnt lgkmcnt(0)
9870                                                             must happen after
9871                                                             any preceding
9872                                                             local/generic
9873                                                             load/store/load
9874                                                             atomic/store
9875                                                             atomic/atomicrmw.
9876                                                           - Must happen before
9877                                                             the following
9878                                                             atomicrmw.
9879                                                           - Ensures that all
9880                                                             memory operations
9881                                                             to global and local
9882                                                             have completed
9883                                                             before performing
9884                                                             the atomicrmw that
9885                                                             is being released.
9886
9887                                                         3. buffer/global/flat_atomic sc1=1
9888     atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9889                                              - generic
9890                                                           - Must happen before
9891                                                             following s_waitcnt.
9892                                                           - Performs L2 writeback to
9893                                                             ensure previous
9894                                                             global/generic
9895                                                             store/atomicrmw are
9896                                                             visible at system scope.
9897
9898                                                         2. s_waitcnt lgkmcnt(0) &
9899                                                            vmcnt(0)
9900
9901                                                           - If TgSplit execution mode,
9902                                                             omit lgkmcnt(0).
9903                                                           - If OpenCL, omit
9904                                                             lgkmcnt(0).
9905                                                           - Could be split into
9906                                                             separate s_waitcnt
9907                                                             vmcnt(0) and
9908                                                             s_waitcnt
9909                                                             lgkmcnt(0) to allow
9910                                                             them to be
9911                                                             independently moved
9912                                                             according to the
9913                                                             following rules.
9914                                                           - s_waitcnt vmcnt(0)
9915                                                             must happen after
9916                                                             any preceding
9917                                                             global/generic
9918                                                             load/store/load
9919                                                             atomic/store
9920                                                             atomic/atomicrmw.
9921                                                           - s_waitcnt lgkmcnt(0)
9922                                                             must happen after
9923                                                             any preceding
9924                                                             local/generic
9925                                                             load/store/load
9926                                                             atomic/store
9927                                                             atomic/atomicrmw.
9928                                                           - Must happen before
9929                                                             the following
9930                                                             atomicrmw.
9931                                                           - Ensures that all
9932                                                             memory operations
9933                                                             to memory and the L2
9934                                                             writeback have
9935                                                             completed before
9936                                                             performing the
9937                                                             store that is being
9938                                                             released.
9939
9940                                                         3. buffer/global/flat_atomic
9941                                                            sc0=1 sc1=1
9942     fence        release      - singlethread *none*     *none*
9943                               - wavefront
9944     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9945
9946                                                           - Use lgkmcnt(0) if not
9947                                                             TgSplit execution mode
9948                                                             and vmcnt(0) if TgSplit
9949                                                             execution mode.
9950                                                           - If OpenCL and
9951                                                             address space is
9952                                                             not generic, omit
9953                                                             lgkmcnt(0).
9954                                                           - If OpenCL and
9955                                                             address space is
9956                                                             local, omit
9957                                                             vmcnt(0).
9958                                                           - However, since LLVM
9959                                                             currently has no
9960                                                             address space on
9961                                                             the fence need to
9962                                                             conservatively
9963                                                             always generate. If
9964                                                             fence had an
9965                                                             address space then
9966                                                             set to address
9967                                                             space of OpenCL
9968                                                             fence flag, or to
9969                                                             generic if both
9970                                                             local and global
9971                                                             flags are
9972                                                             specified.
9973                                                           - s_waitcnt vmcnt(0)
9974                                                             must happen after
9975                                                             any preceding
9976                                                             global/generic
9977                                                             load/store/
9978                                                             load atomic/store atomic/
9979                                                             atomicrmw.
9980                                                           - s_waitcnt lgkmcnt(0)
9981                                                             must happen after
9982                                                             any preceding
9983                                                             local/generic
9984                                                             load/load
9985                                                             atomic/store/store
9986                                                             atomic/atomicrmw.
9987                                                           - Must happen before
9988                                                             any following store
9989                                                             atomic/atomicrmw
9990                                                             with an equal or
9991                                                             wider sync scope
9992                                                             and memory ordering
9993                                                             stronger than
9994                                                             unordered (this is
9995                                                             termed the
9996                                                             fence-paired-atomic).
9997                                                           - Ensures that all
9998                                                             memory operations
9999                                                             have
10000                                                             completed before
10001                                                             performing the
10002                                                             following
10003                                                             fence-paired-atomic.
10004
10005     fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
10006
10007                                                           - If OpenCL and
10008                                                             address space is
10009                                                             local, omit.
10010                                                           - Must happen before
10011                                                             following s_waitcnt.
10012                                                           - Performs L2 writeback to
10013                                                             ensure previous
10014                                                             global/generic
10015                                                             store/atomicrmw are
10016                                                             visible at agent scope.
10017
10018                                                         2. s_waitcnt lgkmcnt(0) &
10019                                                            vmcnt(0)
10020
10021                                                           - If TgSplit execution mode,
10022                                                             omit lgkmcnt(0).
10023                                                           - If OpenCL and
10024                                                             address space is
10025                                                             not generic, omit
10026                                                             lgkmcnt(0).
10027                                                           - If OpenCL and
10028                                                             address space is
10029                                                             local, omit
10030                                                             vmcnt(0).
10031                                                           - However, since LLVM
10032                                                             currently has no
10033                                                             address space on
10034                                                             the fence need to
10035                                                             conservatively
10036                                                             always generate. If
10037                                                             fence had an
10038                                                             address space then
10039                                                             set to address
10040                                                             space of OpenCL
10041                                                             fence flag, or to
10042                                                             generic if both
10043                                                             local and global
10044                                                             flags are
10045                                                             specified.
10046                                                           - Could be split into
10047                                                             separate s_waitcnt
10048                                                             vmcnt(0) and
10049                                                             s_waitcnt
10050                                                             lgkmcnt(0) to allow
10051                                                             them to be
10052                                                             independently moved
10053                                                             according to the
10054                                                             following rules.
10055                                                           - s_waitcnt vmcnt(0)
10056                                                             must happen after
10057                                                             any preceding
10058                                                             global/generic
10059                                                             load/store/load
10060                                                             atomic/store
10061                                                             atomic/atomicrmw.
10062                                                           - s_waitcnt lgkmcnt(0)
10063                                                             must happen after
10064                                                             any preceding
10065                                                             local/generic
10066                                                             load/store/load
10067                                                             atomic/store
10068                                                             atomic/atomicrmw.
10069                                                           - Must happen before
10070                                                             any following store
10071                                                             atomic/atomicrmw
10072                                                             with an equal or
10073                                                             wider sync scope
10074                                                             and memory ordering
10075                                                             stronger than
10076                                                             unordered (this is
10077                                                             termed the
10078                                                             fence-paired-atomic).
10079                                                           - Ensures that all
10080                                                             memory operations
10081                                                             have
10082                                                             completed before
10083                                                             performing the
10084                                                             following
10085                                                             fence-paired-atomic.
10086
10087     fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10088
10089                                                           - Must happen before
10090                                                             following s_waitcnt.
10091                                                           - Performs L2 writeback to
10092                                                             ensure previous
10093                                                             global/generic
10094                                                             store/atomicrmw are
10095                                                             visible at system scope.
10096
10097                                                         2. s_waitcnt lgkmcnt(0) &
10098                                                            vmcnt(0)
10099
10100                                                           - If TgSplit execution mode,
10101                                                             omit lgkmcnt(0).
10102                                                           - If OpenCL and
10103                                                             address space is
10104                                                             not generic, omit
10105                                                             lgkmcnt(0).
10106                                                           - If OpenCL and
10107                                                             address space is
10108                                                             local, omit
10109                                                             vmcnt(0).
10110                                                           - However, since LLVM
10111                                                             currently has no
10112                                                             address space on
10113                                                             the fence need to
10114                                                             conservatively
10115                                                             always generate. If
10116                                                             fence had an
10117                                                             address space then
10118                                                             set to address
10119                                                             space of OpenCL
10120                                                             fence flag, or to
10121                                                             generic if both
10122                                                             local and global
10123                                                             flags are
10124                                                             specified.
10125                                                           - Could be split into
10126                                                             separate s_waitcnt
10127                                                             vmcnt(0) and
10128                                                             s_waitcnt
10129                                                             lgkmcnt(0) to allow
10130                                                             them to be
10131                                                             independently moved
10132                                                             according to the
10133                                                             following rules.
10134                                                           - s_waitcnt vmcnt(0)
10135                                                             must happen after
10136                                                             any preceding
10137                                                             global/generic
10138                                                             load/store/load
10139                                                             atomic/store
10140                                                             atomic/atomicrmw.
10141                                                           - s_waitcnt lgkmcnt(0)
10142                                                             must happen after
10143                                                             any preceding
10144                                                             local/generic
10145                                                             load/store/load
10146                                                             atomic/store
10147                                                             atomic/atomicrmw.
10148                                                           - Must happen before
10149                                                             any following store
10150                                                             atomic/atomicrmw
10151                                                             with an equal or
10152                                                             wider sync scope
10153                                                             and memory ordering
10154                                                             stronger than
10155                                                             unordered (this is
10156                                                             termed the
10157                                                             fence-paired-atomic).
10158                                                           - Ensures that all
10159                                                             memory operations
10160                                                             have
10161                                                             completed before
10162                                                             performing the
10163                                                             following
10164                                                             fence-paired-atomic.
10165
10166     **Acquire-Release Atomic**
10167     ------------------------------------------------------------------------------------
10168     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
10169                               - wavefront    - generic
10170     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
10171                               - wavefront               local address space cannot
10172                                                         be used.*
10173
10174                                                         1. ds_atomic
10175     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10176
10177                                                           - Use lgkmcnt(0) if not
10178                                                             TgSplit execution mode
10179                                                             and vmcnt(0) if TgSplit
10180                                                             execution mode.
10181                                                           - If OpenCL, omit
10182                                                             lgkmcnt(0).
10183                                                           - Must happen after
10184                                                             any preceding
10185                                                             local/generic
10186                                                             load/store/load
10187                                                             atomic/store
10188                                                             atomic/atomicrmw.
10189                                                           - s_waitcnt vmcnt(0)
10190                                                             must happen after
10191                                                             any preceding
10192                                                             global/generic load/store/
10193                                                             load atomic/store atomic/
10194                                                             atomicrmw.
10195                                                           - s_waitcnt lgkmcnt(0)
10196                                                             must happen after
10197                                                             any preceding
10198                                                             local/generic
10199                                                             load/store/load
10200                                                             atomic/store
10201                                                             atomic/atomicrmw.
10202                                                           - Must happen before
10203                                                             the following
10204                                                             atomicrmw.
10205                                                           - Ensures that all
10206                                                             memory operations
10207                                                             have
10208                                                             completed before
10209                                                             performing the
10210                                                             atomicrmw that is
10211                                                             being released.
10212
10213                                                         2. buffer/global_atomic
10214                                                         3. s_waitcnt vmcnt(0)
10215
10216                                                           - If not TgSplit execution
10217                                                             mode, omit.
10218                                                           - Must happen before
10219                                                             the following
10220                                                             buffer_inv.
10221                                                           - Ensures any
10222                                                             following global
10223                                                             data read is no
10224                                                             older than the
10225                                                             atomicrmw value
10226                                                             being acquired.
10227
10228                                                         4. buffer_inv sc0=1
10229
10230                                                           - If not TgSplit execution
10231                                                             mode, omit.
10232                                                           - Ensures that
10233                                                             following
10234                                                             loads will not see
10235                                                             stale data.
10236
10237     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
10238                                                         local address space cannot
10239                                                         be used.*
10240
10241                                                         1. ds_atomic
10242                                                         2. s_waitcnt lgkmcnt(0)
10243
10244                                                           - If OpenCL, omit.
10245                                                           - Must happen before
10246                                                             any following
10247                                                             global/generic
10248                                                             load/load
10249                                                             atomic/store/store
10250                                                             atomic/atomicrmw.
10251                                                           - Ensures any
10252                                                             following global
10253                                                             data read is no
10254                                                             older than the local load
10255                                                             atomic value being
10256                                                             acquired.
10257
10258     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
10259
10260                                                           - Use lgkmcnt(0) if not
10261                                                             TgSplit execution mode
10262                                                             and vmcnt(0) if TgSplit
10263                                                             execution mode.
10264                                                           - If OpenCL, omit
10265                                                             lgkmcnt(0).
10266                                                           - s_waitcnt vmcnt(0)
10267                                                             must happen after
10268                                                             any preceding
10269                                                             global/generic load/store/
10270                                                             load atomic/store atomic/
10271                                                             atomicrmw.
10272                                                           - s_waitcnt lgkmcnt(0)
10273                                                             must happen after
10274                                                             any preceding
10275                                                             local/generic
10276                                                             load/store/load
10277                                                             atomic/store
10278                                                             atomic/atomicrmw.
10279                                                           - Must happen before
10280                                                             the following
10281                                                             atomicrmw.
10282                                                           - Ensures that all
10283                                                             memory operations
10284                                                             have
10285                                                             completed before
10286                                                             performing the
10287                                                             atomicrmw that is
10288                                                             being released.
10289
10290                                                         2. flat_atomic
10291                                                         3. s_waitcnt lgkmcnt(0) &
10292                                                            vmcnt(0)
10293
10294                                                           - If not TgSplit execution
10295                                                             mode, omit vmcnt(0).
10296                                                           - If OpenCL, omit
10297                                                             lgkmcnt(0).
10298                                                           - Must happen before
10299                                                             the following
10300                                                             buffer_inv and
10301                                                             any following
10302                                                             global/generic
10303                                                             load/load
10304                                                             atomic/store/store
10305                                                             atomic/atomicrmw.
10306                                                           - Ensures any
10307                                                             following global
10308                                                             data read is no
10309                                                             older than a local load
10310                                                             atomic value being
10311                                                             acquired.
10312
10313                                                         3. buffer_inv sc0=1
10314
10315                                                           - If not TgSplit execution
10316                                                             mode, omit.
10317                                                           - Ensures that
10318                                                             following
10319                                                             loads will not see
10320                                                             stale data.
10321
10322     atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
10323
10324                                                           - Must happen before
10325                                                             following s_waitcnt.
10326                                                           - Performs L2 writeback to
10327                                                             ensure previous
10328                                                             global/generic
10329                                                             store/atomicrmw are
10330                                                             visible at agent scope.
10331
10332                                                         2. s_waitcnt lgkmcnt(0) &
10333                                                            vmcnt(0)
10334
10335                                                           - If TgSplit execution mode,
10336                                                             omit lgkmcnt(0).
10337                                                           - If OpenCL, omit
10338                                                             lgkmcnt(0).
10339                                                           - Could be split into
10340                                                             separate s_waitcnt
10341                                                             vmcnt(0) and
10342                                                             s_waitcnt
10343                                                             lgkmcnt(0) to allow
10344                                                             them to be
10345                                                             independently moved
10346                                                             according to the
10347                                                             following rules.
10348                                                           - s_waitcnt vmcnt(0)
10349                                                             must happen after
10350                                                             any preceding
10351                                                             global/generic
10352                                                             load/store/load
10353                                                             atomic/store
10354                                                             atomic/atomicrmw.
10355                                                           - s_waitcnt lgkmcnt(0)
10356                                                             must happen after
10357                                                             any preceding
10358                                                             local/generic
10359                                                             load/store/load
10360                                                             atomic/store
10361                                                             atomic/atomicrmw.
10362                                                           - Must happen before
10363                                                             the following
10364                                                             atomicrmw.
10365                                                           - Ensures that all
10366                                                             memory operations
10367                                                             to global have
10368                                                             completed before
10369                                                             performing the
10370                                                             atomicrmw that is
10371                                                             being released.
10372
10373                                                         3. buffer/global_atomic
10374                                                         4. s_waitcnt vmcnt(0)
10375
10376                                                           - Must happen before
10377                                                             following
10378                                                             buffer_inv.
10379                                                           - Ensures the
10380                                                             atomicrmw has
10381                                                             completed before
10382                                                             invalidating the
10383                                                             cache.
10384
10385                                                         5. buffer_inv sc1=1
10386
10387                                                           - Must happen before
10388                                                             any following
10389                                                             global/generic
10390                                                             load/load
10391                                                             atomic/atomicrmw.
10392                                                           - Ensures that
10393                                                             following loads
10394                                                             will not see stale
10395                                                             global data.
10396
10397     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10398
10399                                                           - Must happen before
10400                                                             following s_waitcnt.
10401                                                           - Performs L2 writeback to
10402                                                             ensure previous
10403                                                             global/generic
10404                                                             store/atomicrmw are
10405                                                             visible at system scope.
10406
10407                                                         2. s_waitcnt lgkmcnt(0) &
10408                                                            vmcnt(0)
10409
10410                                                           - If TgSplit execution mode,
10411                                                             omit lgkmcnt(0).
10412                                                           - If OpenCL, omit
10413                                                             lgkmcnt(0).
10414                                                           - Could be split into
10415                                                             separate s_waitcnt
10416                                                             vmcnt(0) and
10417                                                             s_waitcnt
10418                                                             lgkmcnt(0) to allow
10419                                                             them to be
10420                                                             independently moved
10421                                                             according to the
10422                                                             following rules.
10423                                                           - s_waitcnt vmcnt(0)
10424                                                             must happen after
10425                                                             any preceding
10426                                                             global/generic
10427                                                             load/store/load
10428                                                             atomic/store
10429                                                             atomic/atomicrmw.
10430                                                           - s_waitcnt lgkmcnt(0)
10431                                                             must happen after
10432                                                             any preceding
10433                                                             local/generic
10434                                                             load/store/load
10435                                                             atomic/store
10436                                                             atomic/atomicrmw.
10437                                                           - Must happen before
10438                                                             the following
10439                                                             atomicrmw.
10440                                                           - Ensures that all
10441                                                             memory operations
10442                                                             to global and L2 writeback
10443                                                             have completed before
10444                                                             performing the
10445                                                             atomicrmw that is
10446                                                             being released.
10447
10448                                                         3. buffer/global_atomic
10449                                                            sc1=1
10450                                                         4. s_waitcnt vmcnt(0)
10451
10452                                                           - Must happen before
10453                                                             following
10454                                                             buffer_inv.
10455                                                           - Ensures the
10456                                                             atomicrmw has
10457                                                             completed before
10458                                                             invalidating the
10459                                                             caches.
10460
10461                                                         5. buffer_inv sc0=1 sc1=1
10462
10463                                                           - Must happen before
10464                                                             any following
10465                                                             global/generic
10466                                                             load/load
10467                                                             atomic/atomicrmw.
10468                                                           - Ensures that
10469                                                             following loads
10470                                                             will not see stale
10471                                                             MTYPE NC global data.
10472                                                             MTYPE RW and CC memory will
10473                                                             never be stale due to the
10474                                                             memory probes.
10475
10476     atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
10477
10478                                                           - Must happen before
10479                                                             following s_waitcnt.
10480                                                           - Performs L2 writeback to
10481                                                             ensure previous
10482                                                             global/generic
10483                                                             store/atomicrmw are
10484                                                             visible at agent scope.
10485
10486                                                         2. s_waitcnt lgkmcnt(0) &
10487                                                            vmcnt(0)
10488
10489                                                           - If TgSplit execution mode,
10490                                                             omit lgkmcnt(0).
10491                                                           - If OpenCL, omit
10492                                                             lgkmcnt(0).
10493                                                           - Could be split into
10494                                                             separate s_waitcnt
10495                                                             vmcnt(0) and
10496                                                             s_waitcnt
10497                                                             lgkmcnt(0) to allow
10498                                                             them to be
10499                                                             independently moved
10500                                                             according to the
10501                                                             following rules.
10502                                                           - s_waitcnt vmcnt(0)
10503                                                             must happen after
10504                                                             any preceding
10505                                                             global/generic
10506                                                             load/store/load
10507                                                             atomic/store
10508                                                             atomic/atomicrmw.
10509                                                           - s_waitcnt lgkmcnt(0)
10510                                                             must happen after
10511                                                             any preceding
10512                                                             local/generic
10513                                                             load/store/load
10514                                                             atomic/store
10515                                                             atomic/atomicrmw.
10516                                                           - Must happen before
10517                                                             the following
10518                                                             atomicrmw.
10519                                                           - Ensures that all
10520                                                             memory operations
10521                                                             to global have
10522                                                             completed before
10523                                                             performing the
10524                                                             atomicrmw that is
10525                                                             being released.
10526
10527                                                         3. flat_atomic
10528                                                         4. s_waitcnt vmcnt(0) &
10529                                                            lgkmcnt(0)
10530
10531                                                           - If TgSplit execution mode,
10532                                                             omit lgkmcnt(0).
10533                                                           - If OpenCL, omit
10534                                                             lgkmcnt(0).
10535                                                           - Must happen before
10536                                                             following
10537                                                             buffer_inv.
10538                                                           - Ensures the
10539                                                             atomicrmw has
10540                                                             completed before
10541                                                             invalidating the
10542                                                             cache.
10543
10544                                                         5. buffer_inv sc1=1
10545
10546                                                           - Must happen before
10547                                                             any following
10548                                                             global/generic
10549                                                             load/load
10550                                                             atomic/atomicrmw.
10551                                                           - Ensures that
10552                                                             following loads
10553                                                             will not see stale
10554                                                             global data.
10555
10556     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
10557
10558                                                           - Must happen before
10559                                                             following s_waitcnt.
10560                                                           - Performs L2 writeback to
10561                                                             ensure previous
10562                                                             global/generic
10563                                                             store/atomicrmw are
10564                                                             visible at system scope.
10565
10566                                                         2. s_waitcnt lgkmcnt(0) &
10567                                                            vmcnt(0)
10568
10569                                                           - If TgSplit execution mode,
10570                                                             omit lgkmcnt(0).
10571                                                           - If OpenCL, omit
10572                                                             lgkmcnt(0).
10573                                                           - Could be split into
10574                                                             separate s_waitcnt
10575                                                             vmcnt(0) and
10576                                                             s_waitcnt
10577                                                             lgkmcnt(0) to allow
10578                                                             them to be
10579                                                             independently moved
10580                                                             according to the
10581                                                             following rules.
10582                                                           - s_waitcnt vmcnt(0)
10583                                                             must happen after
10584                                                             any preceding
10585                                                             global/generic
10586                                                             load/store/load
10587                                                             atomic/store
10588                                                             atomic/atomicrmw.
10589                                                           - s_waitcnt lgkmcnt(0)
10590                                                             must happen after
10591                                                             any preceding
10592                                                             local/generic
10593                                                             load/store/load
10594                                                             atomic/store
10595                                                             atomic/atomicrmw.
10596                                                           - Must happen before
10597                                                             the following
10598                                                             atomicrmw.
10599                                                           - Ensures that all
10600                                                             memory operations
10601                                                             to global and L2 writeback
10602                                                             have completed before
10603                                                             performing the
10604                                                             atomicrmw that is
10605                                                             being released.
10606
10607                                                         3. flat_atomic sc1=1
10608                                                         4. s_waitcnt vmcnt(0) &
10609                                                            lgkmcnt(0)
10610
10611                                                           - If TgSplit execution mode,
10612                                                             omit lgkmcnt(0).
10613                                                           - If OpenCL, omit
10614                                                             lgkmcnt(0).
10615                                                           - Must happen before
10616                                                             following
10617                                                             buffer_inv.
10618                                                           - Ensures the
10619                                                             atomicrmw has
10620                                                             completed before
10621                                                             invalidating the
10622                                                             caches.
10623
10624                                                         5. buffer_inv sc0=1 sc1=1
10625
10626                                                           - Must happen before
10627                                                             any following
10628                                                             global/generic
10629                                                             load/load
10630                                                             atomic/atomicrmw.
10631                                                           - Ensures that
10632                                                             following loads
10633                                                             will not see stale
10634                                                             MTYPE NC global data.
10635                                                             MTYPE RW and CC memory will
10636                                                             never be stale due to the
10637                                                             memory probes.
10638
10639     fence        acq_rel      - singlethread *none*     *none*
10640                               - wavefront
10641     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10642
10643                                                           - Use lgkmcnt(0) if not
10644                                                             TgSplit execution mode
10645                                                             and vmcnt(0) if TgSplit
10646                                                             execution mode.
10647                                                           - If OpenCL and
10648                                                             address space is
10649                                                             not generic, omit
10650                                                             lgkmcnt(0).
10651                                                           - If OpenCL and
10652                                                             address space is
10653                                                             local, omit
10654                                                             vmcnt(0).
10655                                                           - However,
10656                                                             since LLVM
10657                                                             currently has no
10658                                                             address space on
10659                                                             the fence need to
10660                                                             conservatively
10661                                                             always generate
10662                                                             (see comment for
10663                                                             previous fence).
10664                                                           - s_waitcnt vmcnt(0)
10665                                                             must happen after
10666                                                             any preceding
10667                                                             global/generic
10668                                                             load/store/
10669                                                             load atomic/store atomic/
10670                                                             atomicrmw.
10671                                                           - s_waitcnt lgkmcnt(0)
10672                                                             must happen after
10673                                                             any preceding
10674                                                             local/generic
10675                                                             load/load
10676                                                             atomic/store/store
10677                                                             atomic/atomicrmw.
10678                                                           - Must happen before
10679                                                             any following
10680                                                             global/generic
10681                                                             load/load
10682                                                             atomic/store/store
10683                                                             atomic/atomicrmw.
10684                                                           - Ensures that all
10685                                                             memory operations
10686                                                             have
10687                                                             completed before
10688                                                             performing any
10689                                                             following global
10690                                                             memory operations.
10691                                                           - Ensures that the
10692                                                             preceding
10693                                                             local/generic load
10694                                                             atomic/atomicrmw
10695                                                             with an equal or
10696                                                             wider sync scope
10697                                                             and memory ordering
10698                                                             stronger than
10699                                                             unordered (this is
10700                                                             termed the
10701                                                             acquire-fence-paired-atomic)
10702                                                             has completed
10703                                                             before following
10704                                                             global memory
10705                                                             operations. This
10706                                                             satisfies the
10707                                                             requirements of
10708                                                             acquire.
10709                                                           - Ensures that all
10710                                                             previous memory
10711                                                             operations have
10712                                                             completed before a
10713                                                             following
10714                                                             local/generic store
10715                                                             atomic/atomicrmw
10716                                                             with an equal or
10717                                                             wider sync scope
10718                                                             and memory ordering
10719                                                             stronger than
10720                                                             unordered (this is
10721                                                             termed the
10722                                                             release-fence-paired-atomic).
10723                                                             This satisfies the
10724                                                             requirements of
10725                                                             release.
10726                                                           - Must happen before
10727                                                             the following
10728                                                             buffer_inv.
10729                                                           - Ensures that the
10730                                                             acquire-fence-paired
10731                                                             atomic has completed
10732                                                             before invalidating
10733                                                             the
10734                                                             cache. Therefore
10735                                                             any following
10736                                                             locations read must
10737                                                             be no older than
10738                                                             the value read by
10739                                                             the
10740                                                             acquire-fence-paired-atomic.
10741
10742                                                         3. buffer_inv sc0=1
10743
10744                                                           - If not TgSplit execution
10745                                                             mode, omit.
10746                                                           - Ensures that
10747                                                             following
10748                                                             loads will not see
10749                                                             stale data.
10750
10751     fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
10752
10753                                                           - If OpenCL and
10754                                                             address space is
10755                                                             local, omit.
10756                                                           - Must happen before
10757                                                             following s_waitcnt.
10758                                                           - Performs L2 writeback to
10759                                                             ensure previous
10760                                                             global/generic
10761                                                             store/atomicrmw are
10762                                                             visible at agent scope.
10763
10764                                                         2. s_waitcnt lgkmcnt(0) &
10765                                                            vmcnt(0)
10766
10767                                                           - If TgSplit execution mode,
10768                                                             omit lgkmcnt(0).
10769                                                           - If OpenCL and
10770                                                             address space is
10771                                                             not generic, omit
10772                                                             lgkmcnt(0).
10773                                                           - However, since LLVM
10774                                                             currently has no
10775                                                             address space on
10776                                                             the fence need to
10777                                                             conservatively
10778                                                             always generate
10779                                                             (see comment for
10780                                                             previous fence).
10781                                                           - Could be split into
10782                                                             separate s_waitcnt
10783                                                             vmcnt(0) and
10784                                                             s_waitcnt
10785                                                             lgkmcnt(0) to allow
10786                                                             them to be
10787                                                             independently moved
10788                                                             according to the
10789                                                             following rules.
10790                                                           - s_waitcnt vmcnt(0)
10791                                                             must happen after
10792                                                             any preceding
10793                                                             global/generic
10794                                                             load/store/load
10795                                                             atomic/store
10796                                                             atomic/atomicrmw.
10797                                                           - s_waitcnt lgkmcnt(0)
10798                                                             must happen after
10799                                                             any preceding
10800                                                             local/generic
10801                                                             load/store/load
10802                                                             atomic/store
10803                                                             atomic/atomicrmw.
10804                                                           - Must happen before
10805                                                             the following
10806                                                             buffer_inv.
10807                                                           - Ensures that the
10808                                                             preceding
10809                                                             global/local/generic
10810                                                             load
10811                                                             atomic/atomicrmw
10812                                                             with an equal or
10813                                                             wider sync scope
10814                                                             and memory ordering
10815                                                             stronger than
10816                                                             unordered (this is
10817                                                             termed the
10818                                                             acquire-fence-paired-atomic)
10819                                                             has completed
10820                                                             before invalidating
10821                                                             the cache. This
10822                                                             satisfies the
10823                                                             requirements of
10824                                                             acquire.
10825                                                           - Ensures that all
10826                                                             previous memory
10827                                                             operations have
10828                                                             completed before a
10829                                                             following
10830                                                             global/local/generic
10831                                                             store
10832                                                             atomic/atomicrmw
10833                                                             with an equal or
10834                                                             wider sync scope
10835                                                             and memory ordering
10836                                                             stronger than
10837                                                             unordered (this is
10838                                                             termed the
10839                                                             release-fence-paired-atomic).
10840                                                             This satisfies the
10841                                                             requirements of
10842                                                             release.
10843
10844                                                         3. buffer_inv sc1=1
10845
10846                                                           - Must happen before
10847                                                             any following
10848                                                             global/generic
10849                                                             load/load
10850                                                             atomic/store/store
10851                                                             atomic/atomicrmw.
10852                                                           - Ensures that
10853                                                             following loads
10854                                                             will not see stale
10855                                                             global data. This
10856                                                             satisfies the
10857                                                             requirements of
10858                                                             acquire.
10859
10860     fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10861
10862                                                           - If OpenCL and
10863                                                             address space is
10864                                                             local, omit.
10865                                                           - Must happen before
10866                                                             following s_waitcnt.
10867                                                           - Performs L2 writeback to
10868                                                             ensure previous
10869                                                             global/generic
10870                                                             store/atomicrmw are
10871                                                             visible at system scope.
10872
10873                                                         1. s_waitcnt lgkmcnt(0) &
10874                                                            vmcnt(0)
10875
10876                                                           - If TgSplit execution mode,
10877                                                             omit lgkmcnt(0).
10878                                                           - If OpenCL and
10879                                                             address space is
10880                                                             not generic, omit
10881                                                             lgkmcnt(0).
10882                                                           - However, since LLVM
10883                                                             currently has no
10884                                                             address space on
10885                                                             the fence need to
10886                                                             conservatively
10887                                                             always generate
10888                                                             (see comment for
10889                                                             previous fence).
10890                                                           - Could be split into
10891                                                             separate s_waitcnt
10892                                                             vmcnt(0) and
10893                                                             s_waitcnt
10894                                                             lgkmcnt(0) to allow
10895                                                             them to be
10896                                                             independently moved
10897                                                             according to the
10898                                                             following rules.
10899                                                           - s_waitcnt vmcnt(0)
10900                                                             must happen after
10901                                                             any preceding
10902                                                             global/generic
10903                                                             load/store/load
10904                                                             atomic/store
10905                                                             atomic/atomicrmw.
10906                                                           - s_waitcnt lgkmcnt(0)
10907                                                             must happen after
10908                                                             any preceding
10909                                                             local/generic
10910                                                             load/store/load
10911                                                             atomic/store
10912                                                             atomic/atomicrmw.
10913                                                           - Must happen before
10914                                                             the following
10915                                                             buffer_inv.
10916                                                           - Ensures that the
10917                                                             preceding
10918                                                             global/local/generic
10919                                                             load
10920                                                             atomic/atomicrmw
10921                                                             with an equal or
10922                                                             wider sync scope
10923                                                             and memory ordering
10924                                                             stronger than
10925                                                             unordered (this is
10926                                                             termed the
10927                                                             acquire-fence-paired-atomic)
10928                                                             has completed
10929                                                             before invalidating
10930                                                             the cache. This
10931                                                             satisfies the
10932                                                             requirements of
10933                                                             acquire.
10934                                                           - Ensures that all
10935                                                             previous memory
10936                                                             operations have
10937                                                             completed before a
10938                                                             following
10939                                                             global/local/generic
10940                                                             store
10941                                                             atomic/atomicrmw
10942                                                             with an equal or
10943                                                             wider sync scope
10944                                                             and memory ordering
10945                                                             stronger than
10946                                                             unordered (this is
10947                                                             termed the
10948                                                             release-fence-paired-atomic).
10949                                                             This satisfies the
10950                                                             requirements of
10951                                                             release.
10952
10953                                                         2. buffer_inv sc0=1 sc1=1
10954
10955                                                           - Must happen before
10956                                                             any following
10957                                                             global/generic
10958                                                             load/load
10959                                                             atomic/store/store
10960                                                             atomic/atomicrmw.
10961                                                           - Ensures that
10962                                                             following loads
10963                                                             will not see stale
10964                                                             MTYPE NC global data.
10965                                                             MTYPE RW and CC memory will
10966                                                             never be stale due to the
10967                                                             memory probes.
10968
10969     **Sequential Consistent Atomic**
10970     ------------------------------------------------------------------------------------
10971     load atomic  seq_cst      - singlethread - global   *Same as corresponding
10972                               - wavefront    - local    load atomic acquire,
10973                                              - generic  except must generate
10974                                                         all instructions even
10975                                                         for OpenCL.*
10976     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10977                                              - generic
10978                                                           - Use lgkmcnt(0) if not
10979                                                             TgSplit execution mode
10980                                                             and vmcnt(0) if TgSplit
10981                                                             execution mode.
10982                                                           - s_waitcnt lgkmcnt(0) must
10983                                                             happen after
10984                                                             preceding
10985                                                             local/generic load
10986                                                             atomic/store
10987                                                             atomic/atomicrmw
10988                                                             with memory
10989                                                             ordering of seq_cst
10990                                                             and with equal or
10991                                                             wider sync scope.
10992                                                             (Note that seq_cst
10993                                                             fences have their
10994                                                             own s_waitcnt
10995                                                             lgkmcnt(0) and so do
10996                                                             not need to be
10997                                                             considered.)
10998                                                           - s_waitcnt vmcnt(0)
10999                                                             must happen after
11000                                                             preceding
11001                                                             global/generic load
11002                                                             atomic/store
11003                                                             atomic/atomicrmw
11004                                                             with memory
11005                                                             ordering of seq_cst
11006                                                             and with equal or
11007                                                             wider sync scope.
11008                                                             (Note that seq_cst
11009                                                             fences have their
11010                                                             own s_waitcnt
11011                                                             vmcnt(0) and so do
11012                                                             not need to be
11013                                                             considered.)
11014                                                           - Ensures any
11015                                                             preceding
11016                                                             sequential
11017                                                             consistent global/local
11018                                                             memory instructions
11019                                                             have completed
11020                                                             before executing
11021                                                             this sequentially
11022                                                             consistent
11023                                                             instruction. This
11024                                                             prevents reordering
11025                                                             a seq_cst store
11026                                                             followed by a
11027                                                             seq_cst load. (Note
11028                                                             that seq_cst is
11029                                                             stronger than
11030                                                             acquire/release as
11031                                                             the reordering of
11032                                                             load acquire
11033                                                             followed by a store
11034                                                             release is
11035                                                             prevented by the
11036                                                             s_waitcnt of
11037                                                             the release, but
11038                                                             there is nothing
11039                                                             preventing a store
11040                                                             release followed by
11041                                                             load acquire from
11042                                                             completing out of
11043                                                             order. The s_waitcnt
11044                                                             could be placed after
11045                                                             seq_store or before
11046                                                             the seq_load. We
11047                                                             choose the load to
11048                                                             make the s_waitcnt be
11049                                                             as late as possible
11050                                                             so that the store
11051                                                             may have already
11052                                                             completed.)
11053
11054                                                         2. *Following
11055                                                            instructions same as
11056                                                            corresponding load
11057                                                            atomic acquire,
11058                                                            except must generate
11059                                                            all instructions even
11060                                                            for OpenCL.*
11061     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
11062                                                         local address space cannot
11063                                                         be used.*
11064
11065                                                         *Same as corresponding
11066                                                         load atomic acquire,
11067                                                         except must generate
11068                                                         all instructions even
11069                                                         for OpenCL.*
11070
11071     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
11072                               - system       - generic     vmcnt(0)
11073
11074                                                           - If TgSplit execution mode,
11075                                                             omit lgkmcnt(0).
11076                                                           - Could be split into
11077                                                             separate s_waitcnt
11078                                                             vmcnt(0)
11079                                                             and s_waitcnt
11080                                                             lgkmcnt(0) to allow
11081                                                             them to be
11082                                                             independently moved
11083                                                             according to the
11084                                                             following rules.
11085                                                           - s_waitcnt lgkmcnt(0)
11086                                                             must happen after
11087                                                             preceding
11088                                                             global/generic load
11089                                                             atomic/store
11090                                                             atomic/atomicrmw
11091                                                             with memory
11092                                                             ordering of seq_cst
11093                                                             and with equal or
11094                                                             wider sync scope.
11095                                                             (Note that seq_cst
11096                                                             fences have their
11097                                                             own s_waitcnt
11098                                                             lgkmcnt(0) and so do
11099                                                             not need to be
11100                                                             considered.)
11101                                                           - s_waitcnt vmcnt(0)
11102                                                             must happen after
11103                                                             preceding
11104                                                             global/generic load
11105                                                             atomic/store
11106                                                             atomic/atomicrmw
11107                                                             with memory
11108                                                             ordering of seq_cst
11109                                                             and with equal or
11110                                                             wider sync scope.
11111                                                             (Note that seq_cst
11112                                                             fences have their
11113                                                             own s_waitcnt
11114                                                             vmcnt(0) and so do
11115                                                             not need to be
11116                                                             considered.)
11117                                                           - Ensures any
11118                                                             preceding
11119                                                             sequential
11120                                                             consistent global
11121                                                             memory instructions
11122                                                             have completed
11123                                                             before executing
11124                                                             this sequentially
11125                                                             consistent
11126                                                             instruction. This
11127                                                             prevents reordering
11128                                                             a seq_cst store
11129                                                             followed by a
11130                                                             seq_cst load. (Note
11131                                                             that seq_cst is
11132                                                             stronger than
11133                                                             acquire/release as
11134                                                             the reordering of
11135                                                             load acquire
11136                                                             followed by a store
11137                                                             release is
11138                                                             prevented by the
11139                                                             s_waitcnt of
11140                                                             the release, but
11141                                                             there is nothing
11142                                                             preventing a store
11143                                                             release followed by
11144                                                             load acquire from
11145                                                             completing out of
11146                                                             order. The s_waitcnt
11147                                                             could be placed after
11148                                                             seq_store or before
11149                                                             the seq_load. We
11150                                                             choose the load to
11151                                                             make the s_waitcnt be
11152                                                             as late as possible
11153                                                             so that the store
11154                                                             may have already
11155                                                             completed.)
11156
11157                                                         2. *Following
11158                                                            instructions same as
11159                                                            corresponding load
11160                                                            atomic acquire,
11161                                                            except must generate
11162                                                            all instructions even
11163                                                            for OpenCL.*
11164     store atomic seq_cst      - singlethread - global   *Same as corresponding
11165                               - wavefront    - local    store atomic release,
11166                               - workgroup    - generic  except must generate
11167                               - agent                   all instructions even
11168                               - system                  for OpenCL.*
11169     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
11170                               - wavefront    - local    atomicrmw acq_rel,
11171                               - workgroup    - generic  except must generate
11172                               - agent                   all instructions even
11173                               - system                  for OpenCL.*
11174     fence        seq_cst      - singlethread *none*     *Same as corresponding
11175                               - wavefront               fence acq_rel,
11176                               - workgroup               except must generate
11177                               - agent                   all instructions even
11178                               - system                  for OpenCL.*
11179     ============ ============ ============== ========== ================================
11180
11181.. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11182
11183Memory Model GFX10-GFX11
11184++++++++++++++++++++++++
11185
11186For GFX10-GFX11:
11187
11188* Each agent has multiple shader arrays (SA).
11189* Each SA has multiple work-group processors (WGP).
11190* Each WGP has multiple compute units (CU).
11191* Each CU has multiple SIMDs that execute wavefronts.
11192* The wavefronts for a single work-group are executed in the same
11193  WGP. In CU wavefront execution mode the wavefronts may be executed by
11194  different SIMDs in the same CU. In WGP wavefront execution mode the
11195  wavefronts may be executed by different SIMDs in different CUs in the same
11196  WGP.
11197* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11198  executing on it.
11199* All LDS operations of a WGP are performed as wavefront wide operations in a
11200  global order and involve no caching. Completion is reported to a wavefront in
11201  execution order.
11202* The LDS memory has multiple request queues shared by the SIMDs of a
11203  WGP. Therefore, the LDS operations performed by different wavefronts of a
11204  work-group can be reordered relative to each other, which can result in
11205  reordering the visibility of vector memory operations with respect to LDS
11206  operations of other wavefronts in the same work-group. A ``s_waitcnt
11207  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11208  vector memory operations between wavefronts of a work-group, but not between
11209  operations performed by the same wavefront.
11210* The vector memory operations are performed as wavefront wide operations.
11211  Completion of load/store/sample operations are reported to a wavefront in
11212  execution order of other load/store/sample operations performed by that
11213  wavefront.
11214* The vector memory operations access a vector L0 cache. There is a single L0
11215  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11216  special action is required for coherence between the lanes of a single
11217  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11218  wavefronts executing in the same work-group as they may be executing on SIMDs
11219  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11220  required for coherence between wavefronts executing in different work-groups
11221  as they may be executing on different WGPs.
11222* The scalar memory operations access a scalar L0 cache shared by all wavefronts
11223  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11224  operations are used in a restricted way so do not impact the memory model. See
11225  :ref:`amdgpu-amdhsa-memory-spaces`.
11226* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11227  the same SA. Therefore, no special action is required for coherence between
11228  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11229  required for coherence between wavefronts executing in different work-groups
11230  as they may be executing on different SAs that access different L1s.
11231* The L1 caches have independent quadrants to service disjoint ranges of virtual
11232  addresses.
11233* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11234  vector and scalar memory operations performed by different wavefronts, whether
11235  executing in the same or different work-groups (which may be executing on
11236  different CUs accessing different L0s), can be reordered relative to each
11237  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11238  synchronization between vector memory operations of different wavefronts. It
11239  ensures a previous vector memory operation has completed before executing a
11240  subsequent vector memory or LDS operation and so can be used to meet the
11241  requirements of acquire, release and sequential consistency.
11242* The L1 caches use an L2 cache shared by all SAs on the same agent.
11243* The L2 cache has independent channels to service disjoint ranges of virtual
11244  addresses.
11245* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11246  quadrant has a separate request queue per L2 channel. Therefore, the vector
11247  and scalar memory operations performed by wavefronts executing in different
11248  work-groups (which may be executing on different SAs) of an agent can be
11249  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11250  required to ensure synchronization between vector memory operations of
11251  different SAs. It ensures a previous vector memory operation has completed
11252  before executing a subsequent vector memory and so can be used to meet the
11253  requirements of acquire, release and sequential consistency.
11254* The L2 cache can be kept coherent with other agents on some targets, or ranges
11255  of virtual addresses can be set up to bypass it to ensure system coherence.
11256* On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11257  The MALL cache is fully coherent with GPU memory and has no impact on system
11258  coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11259
11260Scalar memory operations are only used to access memory that is proven to not
11261change during the execution of the kernel dispatch. This includes constant
11262address space and global address space for program scope ``const`` variables.
11263Therefore, the kernel machine code does not have to maintain the scalar cache to
11264ensure it is coherent with the vector caches. The scalar and vector caches are
11265invalidated between kernel dispatches by CP since constant address space data
11266may change between kernel dispatch executions. See
11267:ref:`amdgpu-amdhsa-memory-spaces`.
11268
11269The one exception is if scalar writes are used to spill SGPR registers. In this
11270case the AMDGPU backend ensures the memory location used to spill is never
11271accessed by vector memory operations at the same time. If scalar writes are used
11272then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11273return since the locations may be used for vector memory instructions by a
11274future wavefront that uses the same scratch area, or a function call that
11275creates a frame at the same address, respectively. There is no need for a
11276``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11277
11278For kernarg backing memory:
11279
11280* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11281* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11282  needing to invalidate the L2 cache.
11283* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11284  so the L2 cache will be coherent with the CPU and other agents.
11285
11286Scratch backing memory (which is used for the private address space) is accessed
11287with MTYPE NC (non-coherent). Since the private address space is only accessed
11288by a single thread, and is always write-before-read, there is never a need to
11289invalidate these entries from the L0 or L1 caches.
11290
11291Wavefronts are executed in native mode with in-order reporting of loads and
11292sample instructions. In this mode vmcnt reports completion of load, atomic with
11293return and sample instructions in order, and the vscnt reports the completion of
11294store and atomic without return in order. See ``MEM_ORDERED`` field in
11295:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11296
11297Wavefronts can be executed in WGP or CU wavefront execution mode:
11298
11299* In WGP wavefront execution mode the wavefronts of a work-group are executed
11300  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11301  CU L0 caches is required for work-group synchronization. Also accesses to L1
11302  at work-group scope need to be explicitly ordered as the accesses from
11303  different CUs are not ordered.
11304* In CU wavefront execution mode the wavefronts of a work-group are executed on
11305  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11306  the work-group access the same L0 which in turn ensures L1 accesses are
11307  ordered and so do not require explicit management of the caches for
11308  work-group synchronization.
11309
11310See ``WGP_MODE`` field in
11311:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11312:ref:`amdgpu-target-features`.
11313
11314The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11315table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11316
11317  .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11318     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11319
11320     ============ ============ ============== ========== ================================
11321     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
11322                  Ordering     Sync Scope     Address    GFX10-GFX11
11323                                              Space
11324     ============ ============ ============== ========== ================================
11325     **Non-Atomic**
11326     ------------------------------------------------------------------------------------
11327     load         *none*       *none*         - global   - !volatile & !nontemporal
11328                                              - generic
11329                                              - private    1. buffer/global/flat_load
11330                                              - constant
11331                                                         - !volatile & nontemporal
11332
11333                                                           1. buffer/global/flat_load
11334                                                              slc=1 dlc=1
11335
11336                                                            - If GFX10, omit dlc=1.
11337
11338                                                         - volatile
11339
11340                                                           1. buffer/global/flat_load
11341                                                              glc=1 dlc=1
11342
11343                                                           2. s_waitcnt vmcnt(0)
11344
11345                                                            - Must happen before
11346                                                              any following volatile
11347                                                              global/generic
11348                                                              load/store.
11349                                                            - Ensures that
11350                                                              volatile
11351                                                              operations to
11352                                                              different
11353                                                              addresses will not
11354                                                              be reordered by
11355                                                              hardware.
11356
11357     load         *none*       *none*         - local    1. ds_load
11358     store        *none*       *none*         - global   - !volatile & !nontemporal
11359                                              - generic
11360                                              - private    1. buffer/global/flat_store
11361                                              - constant
11362                                                         - !volatile & nontemporal
11363
11364                                                           1. buffer/global/flat_store
11365                                                              glc=1 slc=1 dlc=1
11366
11367                                                            - If GFX10, omit dlc=1.
11368
11369                                                         - volatile
11370
11371                                                           1. buffer/global/flat_store
11372                                                              dlc=1
11373
11374                                                            - If GFX10, omit dlc=1.
11375
11376                                                           2. s_waitcnt vscnt(0)
11377
11378                                                            - Must happen before
11379                                                              any following volatile
11380                                                              global/generic
11381                                                              load/store.
11382                                                            - Ensures that
11383                                                              volatile
11384                                                              operations to
11385                                                              different
11386                                                              addresses will not
11387                                                              be reordered by
11388                                                              hardware.
11389
11390     store        *none*       *none*         - local    1. ds_store
11391     **Unordered Atomic**
11392     ------------------------------------------------------------------------------------
11393     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
11394     store atomic unordered    *any*          *any*      *Same as non-atomic*.
11395     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
11396     **Monotonic Atomic**
11397     ------------------------------------------------------------------------------------
11398     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
11399                               - wavefront    - generic
11400     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
11401                                              - generic     glc=1
11402
11403                                                           - If CU wavefront execution
11404                                                             mode, omit glc=1.
11405
11406     load atomic  monotonic    - singlethread - local    1. ds_load
11407                               - wavefront
11408                               - workgroup
11409     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
11410                               - system       - generic     glc=1 dlc=1
11411
11412                                                           - If GFX11, omit dlc=1.
11413
11414     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
11415                               - wavefront    - generic
11416                               - workgroup
11417                               - agent
11418                               - system
11419     store atomic monotonic    - singlethread - local    1. ds_store
11420                               - wavefront
11421                               - workgroup
11422     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
11423                               - wavefront    - generic
11424                               - workgroup
11425                               - agent
11426                               - system
11427     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
11428                               - wavefront
11429                               - workgroup
11430     **Acquire Atomic**
11431     ------------------------------------------------------------------------------------
11432     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
11433                               - wavefront    - local
11434                                              - generic
11435     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
11436
11437                                                           - If CU wavefront execution
11438                                                             mode, omit glc=1.
11439
11440                                                         2. s_waitcnt vmcnt(0)
11441
11442                                                           - If CU wavefront execution
11443                                                             mode, omit.
11444                                                           - Must happen before
11445                                                             the following buffer_gl0_inv
11446                                                             and before any following
11447                                                             global/generic
11448                                                             load/load
11449                                                             atomic/store/store
11450                                                             atomic/atomicrmw.
11451
11452                                                         3. buffer_gl0_inv
11453
11454                                                           - If CU wavefront execution
11455                                                             mode, omit.
11456                                                           - Ensures that
11457                                                             following
11458                                                             loads will not see
11459                                                             stale data.
11460
11461     load atomic  acquire      - workgroup    - local    1. ds_load
11462                                                         2. s_waitcnt lgkmcnt(0)
11463
11464                                                           - If OpenCL, omit.
11465                                                           - Must happen before
11466                                                             the following buffer_gl0_inv
11467                                                             and before any following
11468                                                             global/generic load/load
11469                                                             atomic/store/store
11470                                                             atomic/atomicrmw.
11471                                                           - Ensures any
11472                                                             following global
11473                                                             data read is no
11474                                                             older than the local load
11475                                                             atomic value being
11476                                                             acquired.
11477
11478                                                         3. buffer_gl0_inv
11479
11480                                                           - If CU wavefront execution
11481                                                             mode, omit.
11482                                                           - If OpenCL, omit.
11483                                                           - Ensures that
11484                                                             following
11485                                                             loads will not see
11486                                                             stale data.
11487
11488     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
11489
11490                                                           - If CU wavefront execution
11491                                                             mode, omit glc=1.
11492
11493                                                         2. s_waitcnt lgkmcnt(0) &
11494                                                            vmcnt(0)
11495
11496                                                           - If CU wavefront execution
11497                                                             mode, omit vmcnt(0).
11498                                                           - If OpenCL, omit
11499                                                             lgkmcnt(0).
11500                                                           - Must happen before
11501                                                             the following
11502                                                             buffer_gl0_inv and any
11503                                                             following global/generic
11504                                                             load/load
11505                                                             atomic/store/store
11506                                                             atomic/atomicrmw.
11507                                                           - Ensures any
11508                                                             following global
11509                                                             data read is no
11510                                                             older than a local load
11511                                                             atomic value being
11512                                                             acquired.
11513
11514                                                         3. buffer_gl0_inv
11515
11516                                                           - If CU wavefront execution
11517                                                             mode, omit.
11518                                                           - Ensures that
11519                                                             following
11520                                                             loads will not see
11521                                                             stale data.
11522
11523     load atomic  acquire      - agent        - global   1. buffer/global_load
11524                               - system                     glc=1 dlc=1
11525
11526                                                           - If GFX11, omit dlc=1.
11527
11528                                                         2. s_waitcnt vmcnt(0)
11529
11530                                                           - Must happen before
11531                                                             following
11532                                                             buffer_gl*_inv.
11533                                                           - Ensures the load
11534                                                             has completed
11535                                                             before invalidating
11536                                                             the caches.
11537
11538                                                         3. buffer_gl0_inv;
11539                                                            buffer_gl1_inv
11540
11541                                                           - Must happen before
11542                                                             any following
11543                                                             global/generic
11544                                                             load/load
11545                                                             atomic/atomicrmw.
11546                                                           - Ensures that
11547                                                             following
11548                                                             loads will not see
11549                                                             stale global data.
11550
11551     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
11552                               - system
11553                                                           - If GFX11, omit dlc=1.
11554
11555                                                         2. s_waitcnt vmcnt(0) &
11556                                                            lgkmcnt(0)
11557
11558                                                           - If OpenCL omit
11559                                                             lgkmcnt(0).
11560                                                           - Must happen before
11561                                                             following
11562                                                             buffer_gl*_invl.
11563                                                           - Ensures the flat_load
11564                                                             has completed
11565                                                             before invalidating
11566                                                             the caches.
11567
11568                                                         3. buffer_gl0_inv;
11569                                                            buffer_gl1_inv
11570
11571                                                           - Must happen before
11572                                                             any following
11573                                                             global/generic
11574                                                             load/load
11575                                                             atomic/atomicrmw.
11576                                                           - Ensures that
11577                                                             following loads
11578                                                             will not see stale
11579                                                             global data.
11580
11581     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
11582                               - wavefront    - local
11583                                              - generic
11584     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
11585                                                         2. s_waitcnt vm/vscnt(0)
11586
11587                                                           - If CU wavefront execution
11588                                                             mode, omit.
11589                                                           - Use vmcnt(0) if atomic with
11590                                                             return and vscnt(0) if
11591                                                             atomic with no-return.
11592                                                           - Must happen before
11593                                                             the following buffer_gl0_inv
11594                                                             and before any following
11595                                                             global/generic
11596                                                             load/load
11597                                                             atomic/store/store
11598                                                             atomic/atomicrmw.
11599
11600                                                         3. buffer_gl0_inv
11601
11602                                                           - If CU wavefront execution
11603                                                             mode, omit.
11604                                                           - Ensures that
11605                                                             following
11606                                                             loads will not see
11607                                                             stale data.
11608
11609     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
11610                                                         2. s_waitcnt lgkmcnt(0)
11611
11612                                                           - If OpenCL, omit.
11613                                                           - Must happen before
11614                                                             the following
11615                                                             buffer_gl0_inv.
11616                                                           - Ensures any
11617                                                             following global
11618                                                             data read is no
11619                                                             older than the local
11620                                                             atomicrmw value
11621                                                             being acquired.
11622
11623                                                         3. buffer_gl0_inv
11624
11625                                                           - If OpenCL omit.
11626                                                           - Ensures that
11627                                                             following
11628                                                             loads will not see
11629                                                             stale data.
11630
11631     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
11632                                                         2. s_waitcnt lgkmcnt(0) &
11633                                                            vm/vscnt(0)
11634
11635                                                           - If CU wavefront execution
11636                                                             mode, omit vm/vscnt(0).
11637                                                           - If OpenCL, omit lgkmcnt(0).
11638                                                           - Use vmcnt(0) if atomic with
11639                                                             return and vscnt(0) if
11640                                                             atomic with no-return.
11641                                                           - Must happen before
11642                                                             the following
11643                                                             buffer_gl0_inv.
11644                                                           - Ensures any
11645                                                             following global
11646                                                             data read is no
11647                                                             older than a local
11648                                                             atomicrmw value
11649                                                             being acquired.
11650
11651                                                         3. buffer_gl0_inv
11652
11653                                                           - If CU wavefront execution
11654                                                             mode, omit.
11655                                                           - Ensures that
11656                                                             following
11657                                                             loads will not see
11658                                                             stale data.
11659
11660     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
11661                               - system                  2. s_waitcnt vm/vscnt(0)
11662
11663                                                           - Use vmcnt(0) if atomic with
11664                                                             return and vscnt(0) if
11665                                                             atomic with no-return.
11666                                                           - Must happen before
11667                                                             following
11668                                                             buffer_gl*_inv.
11669                                                           - Ensures the
11670                                                             atomicrmw has
11671                                                             completed before
11672                                                             invalidating the
11673                                                             caches.
11674
11675                                                         3. buffer_gl0_inv;
11676                                                            buffer_gl1_inv
11677
11678                                                           - Must happen before
11679                                                             any following
11680                                                             global/generic
11681                                                             load/load
11682                                                             atomic/atomicrmw.
11683                                                           - Ensures that
11684                                                             following loads
11685                                                             will not see stale
11686                                                             global data.
11687
11688     atomicrmw    acquire      - agent        - generic  1. flat_atomic
11689                               - system                  2. s_waitcnt vm/vscnt(0) &
11690                                                            lgkmcnt(0)
11691
11692                                                           - If OpenCL, omit
11693                                                             lgkmcnt(0).
11694                                                           - Use vmcnt(0) if atomic with
11695                                                             return and vscnt(0) if
11696                                                             atomic with no-return.
11697                                                           - Must happen before
11698                                                             following
11699                                                             buffer_gl*_inv.
11700                                                           - Ensures the
11701                                                             atomicrmw has
11702                                                             completed before
11703                                                             invalidating the
11704                                                             caches.
11705
11706                                                         3. buffer_gl0_inv;
11707                                                            buffer_gl1_inv
11708
11709                                                           - Must happen before
11710                                                             any following
11711                                                             global/generic
11712                                                             load/load
11713                                                             atomic/atomicrmw.
11714                                                           - Ensures that
11715                                                             following loads
11716                                                             will not see stale
11717                                                             global data.
11718
11719     fence        acquire      - singlethread *none*     *none*
11720                               - wavefront
11721     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
11722                                                            vmcnt(0) & vscnt(0)
11723
11724                                                           - If CU wavefront execution
11725                                                             mode, omit vmcnt(0) and
11726                                                             vscnt(0).
11727                                                           - If OpenCL and
11728                                                             address space is
11729                                                             not generic, omit
11730                                                             lgkmcnt(0).
11731                                                           - If OpenCL and
11732                                                             address space is
11733                                                             local, omit
11734                                                             vmcnt(0) and vscnt(0).
11735                                                           - However, since LLVM
11736                                                             currently has no
11737                                                             address space on
11738                                                             the fence need to
11739                                                             conservatively
11740                                                             always generate. If
11741                                                             fence had an
11742                                                             address space then
11743                                                             set to address
11744                                                             space of OpenCL
11745                                                             fence flag, or to
11746                                                             generic if both
11747                                                             local and global
11748                                                             flags are
11749                                                             specified.
11750                                                           - Could be split into
11751                                                             separate s_waitcnt
11752                                                             vmcnt(0), s_waitcnt
11753                                                             vscnt(0) and s_waitcnt
11754                                                             lgkmcnt(0) to allow
11755                                                             them to be
11756                                                             independently moved
11757                                                             according to the
11758                                                             following rules.
11759                                                           - s_waitcnt vmcnt(0)
11760                                                             must happen after
11761                                                             any preceding
11762                                                             global/generic load
11763                                                             atomic/
11764                                                             atomicrmw-with-return-value
11765                                                             with an equal or
11766                                                             wider sync scope
11767                                                             and memory ordering
11768                                                             stronger than
11769                                                             unordered (this is
11770                                                             termed the
11771                                                             fence-paired-atomic).
11772                                                           - s_waitcnt vscnt(0)
11773                                                             must happen after
11774                                                             any preceding
11775                                                             global/generic
11776                                                             atomicrmw-no-return-value
11777                                                             with an equal or
11778                                                             wider sync scope
11779                                                             and memory ordering
11780                                                             stronger than
11781                                                             unordered (this is
11782                                                             termed the
11783                                                             fence-paired-atomic).
11784                                                           - s_waitcnt lgkmcnt(0)
11785                                                             must happen after
11786                                                             any preceding
11787                                                             local/generic load
11788                                                             atomic/atomicrmw
11789                                                             with an equal or
11790                                                             wider sync scope
11791                                                             and memory ordering
11792                                                             stronger than
11793                                                             unordered (this is
11794                                                             termed the
11795                                                             fence-paired-atomic).
11796                                                           - Must happen before
11797                                                             the following
11798                                                             buffer_gl0_inv.
11799                                                           - Ensures that the
11800                                                             fence-paired atomic
11801                                                             has completed
11802                                                             before invalidating
11803                                                             the
11804                                                             cache. Therefore
11805                                                             any following
11806                                                             locations read must
11807                                                             be no older than
11808                                                             the value read by
11809                                                             the
11810                                                             fence-paired-atomic.
11811
11812                                                         3. buffer_gl0_inv
11813
11814                                                           - If CU wavefront execution
11815                                                             mode, omit.
11816                                                           - Ensures that
11817                                                             following
11818                                                             loads will not see
11819                                                             stale data.
11820
11821     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
11822                               - system                     vmcnt(0) & vscnt(0)
11823
11824                                                           - If OpenCL and
11825                                                             address space is
11826                                                             not generic, omit
11827                                                             lgkmcnt(0).
11828                                                           - If OpenCL and
11829                                                             address space is
11830                                                             local, omit
11831                                                             vmcnt(0) and vscnt(0).
11832                                                           - However, since LLVM
11833                                                             currently has no
11834                                                             address space on
11835                                                             the fence need to
11836                                                             conservatively
11837                                                             always generate
11838                                                             (see comment for
11839                                                             previous fence).
11840                                                           - Could be split into
11841                                                             separate s_waitcnt
11842                                                             vmcnt(0), s_waitcnt
11843                                                             vscnt(0) and s_waitcnt
11844                                                             lgkmcnt(0) to allow
11845                                                             them to be
11846                                                             independently moved
11847                                                             according to the
11848                                                             following rules.
11849                                                           - s_waitcnt vmcnt(0)
11850                                                             must happen after
11851                                                             any preceding
11852                                                             global/generic load
11853                                                             atomic/
11854                                                             atomicrmw-with-return-value
11855                                                             with an equal or
11856                                                             wider sync scope
11857                                                             and memory ordering
11858                                                             stronger than
11859                                                             unordered (this is
11860                                                             termed the
11861                                                             fence-paired-atomic).
11862                                                           - s_waitcnt vscnt(0)
11863                                                             must happen after
11864                                                             any preceding
11865                                                             global/generic
11866                                                             atomicrmw-no-return-value
11867                                                             with an equal or
11868                                                             wider sync scope
11869                                                             and memory ordering
11870                                                             stronger than
11871                                                             unordered (this is
11872                                                             termed the
11873                                                             fence-paired-atomic).
11874                                                           - s_waitcnt lgkmcnt(0)
11875                                                             must happen after
11876                                                             any preceding
11877                                                             local/generic load
11878                                                             atomic/atomicrmw
11879                                                             with an equal or
11880                                                             wider sync scope
11881                                                             and memory ordering
11882                                                             stronger than
11883                                                             unordered (this is
11884                                                             termed the
11885                                                             fence-paired-atomic).
11886                                                           - Must happen before
11887                                                             the following
11888                                                             buffer_gl*_inv.
11889                                                           - Ensures that the
11890                                                             fence-paired atomic
11891                                                             has completed
11892                                                             before invalidating
11893                                                             the
11894                                                             caches. Therefore
11895                                                             any following
11896                                                             locations read must
11897                                                             be no older than
11898                                                             the value read by
11899                                                             the
11900                                                             fence-paired-atomic.
11901
11902                                                         2. buffer_gl0_inv;
11903                                                            buffer_gl1_inv
11904
11905                                                           - Must happen before any
11906                                                             following global/generic
11907                                                             load/load
11908                                                             atomic/store/store
11909                                                             atomic/atomicrmw.
11910                                                           - Ensures that
11911                                                             following loads
11912                                                             will not see stale
11913                                                             global data.
11914
11915     **Release Atomic**
11916     ------------------------------------------------------------------------------------
11917     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
11918                               - wavefront    - local
11919                                              - generic
11920     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
11921                                              - generic     vmcnt(0) & vscnt(0)
11922
11923                                                           - If CU wavefront execution
11924                                                             mode, omit vmcnt(0) and
11925                                                             vscnt(0).
11926                                                           - If OpenCL, omit
11927                                                             lgkmcnt(0).
11928                                                           - Could be split into
11929                                                             separate s_waitcnt
11930                                                             vmcnt(0), s_waitcnt
11931                                                             vscnt(0) and s_waitcnt
11932                                                             lgkmcnt(0) to allow
11933                                                             them to be
11934                                                             independently moved
11935                                                             according to the
11936                                                             following rules.
11937                                                           - s_waitcnt vmcnt(0)
11938                                                             must happen after
11939                                                             any preceding
11940                                                             global/generic load/load
11941                                                             atomic/
11942                                                             atomicrmw-with-return-value.
11943                                                           - s_waitcnt vscnt(0)
11944                                                             must happen after
11945                                                             any preceding
11946                                                             global/generic
11947                                                             store/store
11948                                                             atomic/
11949                                                             atomicrmw-no-return-value.
11950                                                           - s_waitcnt lgkmcnt(0)
11951                                                             must happen after
11952                                                             any preceding
11953                                                             local/generic
11954                                                             load/store/load
11955                                                             atomic/store
11956                                                             atomic/atomicrmw.
11957                                                           - Must happen before
11958                                                             the following
11959                                                             store.
11960                                                           - Ensures that all
11961                                                             memory operations
11962                                                             have
11963                                                             completed before
11964                                                             performing the
11965                                                             store that is being
11966                                                             released.
11967
11968                                                         2. buffer/global/flat_store
11969     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
11970
11971                                                           - If CU wavefront execution
11972                                                             mode, omit.
11973                                                           - If OpenCL, omit.
11974                                                           - Could be split into
11975                                                             separate s_waitcnt
11976                                                             vmcnt(0) and s_waitcnt
11977                                                             vscnt(0) to allow
11978                                                             them to be
11979                                                             independently moved
11980                                                             according to the
11981                                                             following rules.
11982                                                           - s_waitcnt vmcnt(0)
11983                                                             must happen after
11984                                                             any preceding
11985                                                             global/generic load/load
11986                                                             atomic/
11987                                                             atomicrmw-with-return-value.
11988                                                           - s_waitcnt vscnt(0)
11989                                                             must happen after
11990                                                             any preceding
11991                                                             global/generic
11992                                                             store/store atomic/
11993                                                             atomicrmw-no-return-value.
11994                                                           - Must happen before
11995                                                             the following
11996                                                             store.
11997                                                           - Ensures that all
11998                                                             global memory
11999                                                             operations have
12000                                                             completed before
12001                                                             performing the
12002                                                             store that is being
12003                                                             released.
12004
12005                                                         2. ds_store
12006     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12007                               - system       - generic     vmcnt(0) & vscnt(0)
12008
12009                                                           - If OpenCL and
12010                                                             address space is
12011                                                             not generic, omit
12012                                                             lgkmcnt(0).
12013                                                           - Could be split into
12014                                                             separate s_waitcnt
12015                                                             vmcnt(0), s_waitcnt vscnt(0)
12016                                                             and s_waitcnt
12017                                                             lgkmcnt(0) to allow
12018                                                             them to be
12019                                                             independently moved
12020                                                             according to the
12021                                                             following rules.
12022                                                           - s_waitcnt vmcnt(0)
12023                                                             must happen after
12024                                                             any preceding
12025                                                             global/generic
12026                                                             load/load
12027                                                             atomic/
12028                                                             atomicrmw-with-return-value.
12029                                                           - s_waitcnt vscnt(0)
12030                                                             must happen after
12031                                                             any preceding
12032                                                             global/generic
12033                                                             store/store atomic/
12034                                                             atomicrmw-no-return-value.
12035                                                           - s_waitcnt lgkmcnt(0)
12036                                                             must happen after
12037                                                             any preceding
12038                                                             local/generic
12039                                                             load/store/load
12040                                                             atomic/store
12041                                                             atomic/atomicrmw.
12042                                                           - Must happen before
12043                                                             the following
12044                                                             store.
12045                                                           - Ensures that all
12046                                                             memory operations
12047                                                             have
12048                                                             completed before
12049                                                             performing the
12050                                                             store that is being
12051                                                             released.
12052
12053                                                         2. buffer/global/flat_store
12054     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
12055                               - wavefront    - local
12056                                              - generic
12057     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12058                                              - generic     vmcnt(0) & vscnt(0)
12059
12060                                                           - If CU wavefront execution
12061                                                             mode, omit vmcnt(0) and
12062                                                             vscnt(0).
12063                                                           - If OpenCL, omit lgkmcnt(0).
12064                                                           - Could be split into
12065                                                             separate s_waitcnt
12066                                                             vmcnt(0), s_waitcnt
12067                                                             vscnt(0) and s_waitcnt
12068                                                             lgkmcnt(0) to allow
12069                                                             them to be
12070                                                             independently moved
12071                                                             according to the
12072                                                             following rules.
12073                                                           - s_waitcnt vmcnt(0)
12074                                                             must happen after
12075                                                             any preceding
12076                                                             global/generic load/load
12077                                                             atomic/
12078                                                             atomicrmw-with-return-value.
12079                                                           - s_waitcnt vscnt(0)
12080                                                             must happen after
12081                                                             any preceding
12082                                                             global/generic
12083                                                             store/store
12084                                                             atomic/
12085                                                             atomicrmw-no-return-value.
12086                                                           - s_waitcnt lgkmcnt(0)
12087                                                             must happen after
12088                                                             any preceding
12089                                                             local/generic
12090                                                             load/store/load
12091                                                             atomic/store
12092                                                             atomic/atomicrmw.
12093                                                           - Must happen before
12094                                                             the following
12095                                                             atomicrmw.
12096                                                           - Ensures that all
12097                                                             memory operations
12098                                                             have
12099                                                             completed before
12100                                                             performing the
12101                                                             atomicrmw that is
12102                                                             being released.
12103
12104                                                         2. buffer/global/flat_atomic
12105     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12106
12107                                                           - If CU wavefront execution
12108                                                             mode, omit.
12109                                                           - If OpenCL, omit.
12110                                                           - Could be split into
12111                                                             separate s_waitcnt
12112                                                             vmcnt(0) and s_waitcnt
12113                                                             vscnt(0) to allow
12114                                                             them to be
12115                                                             independently moved
12116                                                             according to the
12117                                                             following rules.
12118                                                           - s_waitcnt vmcnt(0)
12119                                                             must happen after
12120                                                             any preceding
12121                                                             global/generic load/load
12122                                                             atomic/
12123                                                             atomicrmw-with-return-value.
12124                                                           - s_waitcnt vscnt(0)
12125                                                             must happen after
12126                                                             any preceding
12127                                                             global/generic
12128                                                             store/store atomic/
12129                                                             atomicrmw-no-return-value.
12130                                                           - Must happen before
12131                                                             the following
12132                                                             store.
12133                                                           - Ensures that all
12134                                                             global memory
12135                                                             operations have
12136                                                             completed before
12137                                                             performing the
12138                                                             store that is being
12139                                                             released.
12140
12141                                                         2. ds_atomic
12142     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12143                               - system       - generic      vmcnt(0) & vscnt(0)
12144
12145                                                           - If OpenCL, omit
12146                                                             lgkmcnt(0).
12147                                                           - Could be split into
12148                                                             separate s_waitcnt
12149                                                             vmcnt(0), s_waitcnt
12150                                                             vscnt(0) and s_waitcnt
12151                                                             lgkmcnt(0) to allow
12152                                                             them to be
12153                                                             independently moved
12154                                                             according to the
12155                                                             following rules.
12156                                                           - s_waitcnt vmcnt(0)
12157                                                             must happen after
12158                                                             any preceding
12159                                                             global/generic
12160                                                             load/load atomic/
12161                                                             atomicrmw-with-return-value.
12162                                                           - s_waitcnt vscnt(0)
12163                                                             must happen after
12164                                                             any preceding
12165                                                             global/generic
12166                                                             store/store atomic/
12167                                                             atomicrmw-no-return-value.
12168                                                           - s_waitcnt lgkmcnt(0)
12169                                                             must happen after
12170                                                             any preceding
12171                                                             local/generic
12172                                                             load/store/load
12173                                                             atomic/store
12174                                                             atomic/atomicrmw.
12175                                                           - Must happen before
12176                                                             the following
12177                                                             atomicrmw.
12178                                                           - Ensures that all
12179                                                             memory operations
12180                                                             to global and local
12181                                                             have completed
12182                                                             before performing
12183                                                             the atomicrmw that
12184                                                             is being released.
12185
12186                                                         2. buffer/global/flat_atomic
12187     fence        release      - singlethread *none*     *none*
12188                               - wavefront
12189     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12190                                                            vmcnt(0) & vscnt(0)
12191
12192                                                           - If CU wavefront execution
12193                                                             mode, omit vmcnt(0) and
12194                                                             vscnt(0).
12195                                                           - If OpenCL and
12196                                                             address space is
12197                                                             not generic, omit
12198                                                             lgkmcnt(0).
12199                                                           - If OpenCL and
12200                                                             address space is
12201                                                             local, omit
12202                                                             vmcnt(0) and vscnt(0).
12203                                                           - However, since LLVM
12204                                                             currently has no
12205                                                             address space on
12206                                                             the fence need to
12207                                                             conservatively
12208                                                             always generate. If
12209                                                             fence had an
12210                                                             address space then
12211                                                             set to address
12212                                                             space of OpenCL
12213                                                             fence flag, or to
12214                                                             generic if both
12215                                                             local and global
12216                                                             flags are
12217                                                             specified.
12218                                                           - Could be split into
12219                                                             separate s_waitcnt
12220                                                             vmcnt(0), s_waitcnt
12221                                                             vscnt(0) and s_waitcnt
12222                                                             lgkmcnt(0) to allow
12223                                                             them to be
12224                                                             independently moved
12225                                                             according to the
12226                                                             following rules.
12227                                                           - s_waitcnt vmcnt(0)
12228                                                             must happen after
12229                                                             any preceding
12230                                                             global/generic
12231                                                             load/load
12232                                                             atomic/
12233                                                             atomicrmw-with-return-value.
12234                                                           - s_waitcnt vscnt(0)
12235                                                             must happen after
12236                                                             any preceding
12237                                                             global/generic
12238                                                             store/store atomic/
12239                                                             atomicrmw-no-return-value.
12240                                                           - s_waitcnt lgkmcnt(0)
12241                                                             must happen after
12242                                                             any preceding
12243                                                             local/generic
12244                                                             load/store/load
12245                                                             atomic/store atomic/
12246                                                             atomicrmw.
12247                                                           - Must happen before
12248                                                             any following store
12249                                                             atomic/atomicrmw
12250                                                             with an equal or
12251                                                             wider sync scope
12252                                                             and memory ordering
12253                                                             stronger than
12254                                                             unordered (this is
12255                                                             termed the
12256                                                             fence-paired-atomic).
12257                                                           - Ensures that all
12258                                                             memory operations
12259                                                             have
12260                                                             completed before
12261                                                             performing the
12262                                                             following
12263                                                             fence-paired-atomic.
12264
12265     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12266                               - system                     vmcnt(0) & vscnt(0)
12267
12268                                                           - If OpenCL and
12269                                                             address space is
12270                                                             not generic, omit
12271                                                             lgkmcnt(0).
12272                                                           - If OpenCL and
12273                                                             address space is
12274                                                             local, omit
12275                                                             vmcnt(0) and vscnt(0).
12276                                                           - However, since LLVM
12277                                                             currently has no
12278                                                             address space on
12279                                                             the fence need to
12280                                                             conservatively
12281                                                             always generate. If
12282                                                             fence had an
12283                                                             address space then
12284                                                             set to address
12285                                                             space of OpenCL
12286                                                             fence flag, or to
12287                                                             generic if both
12288                                                             local and global
12289                                                             flags are
12290                                                             specified.
12291                                                           - Could be split into
12292                                                             separate s_waitcnt
12293                                                             vmcnt(0), s_waitcnt
12294                                                             vscnt(0) and s_waitcnt
12295                                                             lgkmcnt(0) to allow
12296                                                             them to be
12297                                                             independently moved
12298                                                             according to the
12299                                                             following rules.
12300                                                           - s_waitcnt vmcnt(0)
12301                                                             must happen after
12302                                                             any preceding
12303                                                             global/generic
12304                                                             load/load atomic/
12305                                                             atomicrmw-with-return-value.
12306                                                           - s_waitcnt vscnt(0)
12307                                                             must happen after
12308                                                             any preceding
12309                                                             global/generic
12310                                                             store/store atomic/
12311                                                             atomicrmw-no-return-value.
12312                                                           - s_waitcnt lgkmcnt(0)
12313                                                             must happen after
12314                                                             any preceding
12315                                                             local/generic
12316                                                             load/store/load
12317                                                             atomic/store
12318                                                             atomic/atomicrmw.
12319                                                           - Must happen before
12320                                                             any following store
12321                                                             atomic/atomicrmw
12322                                                             with an equal or
12323                                                             wider sync scope
12324                                                             and memory ordering
12325                                                             stronger than
12326                                                             unordered (this is
12327                                                             termed the
12328                                                             fence-paired-atomic).
12329                                                           - Ensures that all
12330                                                             memory operations
12331                                                             have
12332                                                             completed before
12333                                                             performing the
12334                                                             following
12335                                                             fence-paired-atomic.
12336
12337     **Acquire-Release Atomic**
12338     ------------------------------------------------------------------------------------
12339     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
12340                               - wavefront    - local
12341                                              - generic
12342     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12343                                                            vmcnt(0) & vscnt(0)
12344
12345                                                           - If CU wavefront execution
12346                                                             mode, omit vmcnt(0) and
12347                                                             vscnt(0).
12348                                                           - If OpenCL, omit
12349                                                             lgkmcnt(0).
12350                                                           - Must happen after
12351                                                             any preceding
12352                                                             local/generic
12353                                                             load/store/load
12354                                                             atomic/store
12355                                                             atomic/atomicrmw.
12356                                                           - Could be split into
12357                                                             separate s_waitcnt
12358                                                             vmcnt(0), s_waitcnt
12359                                                             vscnt(0), and s_waitcnt
12360                                                             lgkmcnt(0) to allow
12361                                                             them to be
12362                                                             independently moved
12363                                                             according to the
12364                                                             following rules.
12365                                                           - s_waitcnt vmcnt(0)
12366                                                             must happen after
12367                                                             any preceding
12368                                                             global/generic load/load
12369                                                             atomic/
12370                                                             atomicrmw-with-return-value.
12371                                                           - s_waitcnt vscnt(0)
12372                                                             must happen after
12373                                                             any preceding
12374                                                             global/generic
12375                                                             store/store
12376                                                             atomic/
12377                                                             atomicrmw-no-return-value.
12378                                                           - s_waitcnt lgkmcnt(0)
12379                                                             must happen after
12380                                                             any preceding
12381                                                             local/generic
12382                                                             load/store/load
12383                                                             atomic/store
12384                                                             atomic/atomicrmw.
12385                                                           - Must happen before
12386                                                             the following
12387                                                             atomicrmw.
12388                                                           - Ensures that all
12389                                                             memory operations
12390                                                             have
12391                                                             completed before
12392                                                             performing the
12393                                                             atomicrmw that is
12394                                                             being released.
12395
12396                                                         2. buffer/global_atomic
12397                                                         3. s_waitcnt vm/vscnt(0)
12398
12399                                                           - If CU wavefront execution
12400                                                             mode, omit.
12401                                                           - Use vmcnt(0) if atomic with
12402                                                             return and vscnt(0) if
12403                                                             atomic with no-return.
12404                                                           - Must happen before
12405                                                             the following
12406                                                             buffer_gl0_inv.
12407                                                           - Ensures any
12408                                                             following global
12409                                                             data read is no
12410                                                             older than the
12411                                                             atomicrmw value
12412                                                             being acquired.
12413
12414                                                         4. buffer_gl0_inv
12415
12416                                                           - If CU wavefront execution
12417                                                             mode, omit.
12418                                                           - Ensures that
12419                                                             following
12420                                                             loads will not see
12421                                                             stale data.
12422
12423     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12424
12425                                                           - If CU wavefront execution
12426                                                             mode, omit.
12427                                                           - If OpenCL, omit.
12428                                                           - Could be split into
12429                                                             separate s_waitcnt
12430                                                             vmcnt(0) and s_waitcnt
12431                                                             vscnt(0) to allow
12432                                                             them to be
12433                                                             independently moved
12434                                                             according to the
12435                                                             following rules.
12436                                                           - s_waitcnt vmcnt(0)
12437                                                             must happen after
12438                                                             any preceding
12439                                                             global/generic load/load
12440                                                             atomic/
12441                                                             atomicrmw-with-return-value.
12442                                                           - s_waitcnt vscnt(0)
12443                                                             must happen after
12444                                                             any preceding
12445                                                             global/generic
12446                                                             store/store atomic/
12447                                                             atomicrmw-no-return-value.
12448                                                           - Must happen before
12449                                                             the following
12450                                                             store.
12451                                                           - Ensures that all
12452                                                             global memory
12453                                                             operations have
12454                                                             completed before
12455                                                             performing the
12456                                                             store that is being
12457                                                             released.
12458
12459                                                         2. ds_atomic
12460                                                         3. s_waitcnt lgkmcnt(0)
12461
12462                                                           - If OpenCL, omit.
12463                                                           - Must happen before
12464                                                             the following
12465                                                             buffer_gl0_inv.
12466                                                           - Ensures any
12467                                                             following global
12468                                                             data read is no
12469                                                             older than the local load
12470                                                             atomic value being
12471                                                             acquired.
12472
12473                                                         4. buffer_gl0_inv
12474
12475                                                           - If CU wavefront execution
12476                                                             mode, omit.
12477                                                           - If OpenCL omit.
12478                                                           - Ensures that
12479                                                             following
12480                                                             loads will not see
12481                                                             stale data.
12482
12483     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
12484                                                            vmcnt(0) & vscnt(0)
12485
12486                                                           - If CU wavefront execution
12487                                                             mode, omit vmcnt(0) and
12488                                                             vscnt(0).
12489                                                           - If OpenCL, omit lgkmcnt(0).
12490                                                           - Could be split into
12491                                                             separate s_waitcnt
12492                                                             vmcnt(0), s_waitcnt
12493                                                             vscnt(0) and s_waitcnt
12494                                                             lgkmcnt(0) to allow
12495                                                             them to be
12496                                                             independently moved
12497                                                             according to the
12498                                                             following rules.
12499                                                           - s_waitcnt vmcnt(0)
12500                                                             must happen after
12501                                                             any preceding
12502                                                             global/generic load/load
12503                                                             atomic/
12504                                                             atomicrmw-with-return-value.
12505                                                           - s_waitcnt vscnt(0)
12506                                                             must happen after
12507                                                             any preceding
12508                                                             global/generic
12509                                                             store/store
12510                                                             atomic/
12511                                                             atomicrmw-no-return-value.
12512                                                           - s_waitcnt lgkmcnt(0)
12513                                                             must happen after
12514                                                             any preceding
12515                                                             local/generic
12516                                                             load/store/load
12517                                                             atomic/store
12518                                                             atomic/atomicrmw.
12519                                                           - Must happen before
12520                                                             the following
12521                                                             atomicrmw.
12522                                                           - Ensures that all
12523                                                             memory operations
12524                                                             have
12525                                                             completed before
12526                                                             performing the
12527                                                             atomicrmw that is
12528                                                             being released.
12529
12530                                                         2. flat_atomic
12531                                                         3. s_waitcnt lgkmcnt(0) &
12532                                                            vmcnt(0) & vscnt(0)
12533
12534                                                           - If CU wavefront execution
12535                                                             mode, omit vmcnt(0) and
12536                                                             vscnt(0).
12537                                                           - If OpenCL, omit lgkmcnt(0).
12538                                                           - Must happen before
12539                                                             the following
12540                                                             buffer_gl0_inv.
12541                                                           - Ensures any
12542                                                             following global
12543                                                             data read is no
12544                                                             older than the load
12545                                                             atomic value being
12546                                                             acquired.
12547
12548                                                         3. buffer_gl0_inv
12549
12550                                                           - If CU wavefront execution
12551                                                             mode, omit.
12552                                                           - Ensures that
12553                                                             following
12554                                                             loads will not see
12555                                                             stale data.
12556
12557     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12558                               - system                     vmcnt(0) & vscnt(0)
12559
12560                                                           - If OpenCL, omit
12561                                                             lgkmcnt(0).
12562                                                           - Could be split into
12563                                                             separate s_waitcnt
12564                                                             vmcnt(0), s_waitcnt
12565                                                             vscnt(0) and s_waitcnt
12566                                                             lgkmcnt(0) to allow
12567                                                             them to be
12568                                                             independently moved
12569                                                             according to the
12570                                                             following rules.
12571                                                           - s_waitcnt vmcnt(0)
12572                                                             must happen after
12573                                                             any preceding
12574                                                             global/generic
12575                                                             load/load atomic/
12576                                                             atomicrmw-with-return-value.
12577                                                           - s_waitcnt vscnt(0)
12578                                                             must happen after
12579                                                             any preceding
12580                                                             global/generic
12581                                                             store/store atomic/
12582                                                             atomicrmw-no-return-value.
12583                                                           - s_waitcnt lgkmcnt(0)
12584                                                             must happen after
12585                                                             any preceding
12586                                                             local/generic
12587                                                             load/store/load
12588                                                             atomic/store
12589                                                             atomic/atomicrmw.
12590                                                           - Must happen before
12591                                                             the following
12592                                                             atomicrmw.
12593                                                           - Ensures that all
12594                                                             memory operations
12595                                                             to global have
12596                                                             completed before
12597                                                             performing the
12598                                                             atomicrmw that is
12599                                                             being released.
12600
12601                                                         2. buffer/global_atomic
12602                                                         3. s_waitcnt vm/vscnt(0)
12603
12604                                                           - Use vmcnt(0) if atomic with
12605                                                             return and vscnt(0) if
12606                                                             atomic with no-return.
12607                                                           - Must happen before
12608                                                             following
12609                                                             buffer_gl*_inv.
12610                                                           - Ensures the
12611                                                             atomicrmw has
12612                                                             completed before
12613                                                             invalidating the
12614                                                             caches.
12615
12616                                                         4. buffer_gl0_inv;
12617                                                            buffer_gl1_inv
12618
12619                                                           - Must happen before
12620                                                             any following
12621                                                             global/generic
12622                                                             load/load
12623                                                             atomic/atomicrmw.
12624                                                           - Ensures that
12625                                                             following loads
12626                                                             will not see stale
12627                                                             global data.
12628
12629     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
12630                               - system                     vmcnt(0) & vscnt(0)
12631
12632                                                           - If OpenCL, omit
12633                                                             lgkmcnt(0).
12634                                                           - Could be split into
12635                                                             separate s_waitcnt
12636                                                             vmcnt(0), s_waitcnt
12637                                                             vscnt(0), and s_waitcnt
12638                                                             lgkmcnt(0) to allow
12639                                                             them to be
12640                                                             independently moved
12641                                                             according to the
12642                                                             following rules.
12643                                                           - s_waitcnt vmcnt(0)
12644                                                             must happen after
12645                                                             any preceding
12646                                                             global/generic
12647                                                             load/load atomic
12648                                                             atomicrmw-with-return-value.
12649                                                           - s_waitcnt vscnt(0)
12650                                                             must happen after
12651                                                             any preceding
12652                                                             global/generic
12653                                                             store/store atomic/
12654                                                             atomicrmw-no-return-value.
12655                                                           - s_waitcnt lgkmcnt(0)
12656                                                             must happen after
12657                                                             any preceding
12658                                                             local/generic
12659                                                             load/store/load
12660                                                             atomic/store
12661                                                             atomic/atomicrmw.
12662                                                           - Must happen before
12663                                                             the following
12664                                                             atomicrmw.
12665                                                           - Ensures that all
12666                                                             memory operations
12667                                                             have
12668                                                             completed before
12669                                                             performing the
12670                                                             atomicrmw that is
12671                                                             being released.
12672
12673                                                         2. flat_atomic
12674                                                         3. s_waitcnt vm/vscnt(0) &
12675                                                            lgkmcnt(0)
12676
12677                                                           - If OpenCL, omit
12678                                                             lgkmcnt(0).
12679                                                           - Use vmcnt(0) if atomic with
12680                                                             return and vscnt(0) if
12681                                                             atomic with no-return.
12682                                                           - Must happen before
12683                                                             following
12684                                                             buffer_gl*_inv.
12685                                                           - Ensures the
12686                                                             atomicrmw has
12687                                                             completed before
12688                                                             invalidating the
12689                                                             caches.
12690
12691                                                         4. buffer_gl0_inv;
12692                                                            buffer_gl1_inv
12693
12694                                                           - Must happen before
12695                                                             any following
12696                                                             global/generic
12697                                                             load/load
12698                                                             atomic/atomicrmw.
12699                                                           - Ensures that
12700                                                             following loads
12701                                                             will not see stale
12702                                                             global data.
12703
12704     fence        acq_rel      - singlethread *none*     *none*
12705                               - wavefront
12706     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12707                                                            vmcnt(0) & vscnt(0)
12708
12709                                                           - If CU wavefront execution
12710                                                             mode, omit vmcnt(0) and
12711                                                             vscnt(0).
12712                                                           - If OpenCL and
12713                                                             address space is
12714                                                             not generic, omit
12715                                                             lgkmcnt(0).
12716                                                           - If OpenCL and
12717                                                             address space is
12718                                                             local, omit
12719                                                             vmcnt(0) and vscnt(0).
12720                                                           - However,
12721                                                             since LLVM
12722                                                             currently has no
12723                                                             address space on
12724                                                             the fence need to
12725                                                             conservatively
12726                                                             always generate
12727                                                             (see comment for
12728                                                             previous fence).
12729                                                           - Could be split into
12730                                                             separate s_waitcnt
12731                                                             vmcnt(0), s_waitcnt
12732                                                             vscnt(0) and s_waitcnt
12733                                                             lgkmcnt(0) to allow
12734                                                             them to be
12735                                                             independently moved
12736                                                             according to the
12737                                                             following rules.
12738                                                           - s_waitcnt vmcnt(0)
12739                                                             must happen after
12740                                                             any preceding
12741                                                             global/generic
12742                                                             load/load
12743                                                             atomic/
12744                                                             atomicrmw-with-return-value.
12745                                                           - s_waitcnt vscnt(0)
12746                                                             must happen after
12747                                                             any preceding
12748                                                             global/generic
12749                                                             store/store atomic/
12750                                                             atomicrmw-no-return-value.
12751                                                           - s_waitcnt lgkmcnt(0)
12752                                                             must happen after
12753                                                             any preceding
12754                                                             local/generic
12755                                                             load/store/load
12756                                                             atomic/store atomic/
12757                                                             atomicrmw.
12758                                                           - Must happen before
12759                                                             any following
12760                                                             global/generic
12761                                                             load/load
12762                                                             atomic/store/store
12763                                                             atomic/atomicrmw.
12764                                                           - Ensures that all
12765                                                             memory operations
12766                                                             have
12767                                                             completed before
12768                                                             performing any
12769                                                             following global
12770                                                             memory operations.
12771                                                           - Ensures that the
12772                                                             preceding
12773                                                             local/generic load
12774                                                             atomic/atomicrmw
12775                                                             with an equal or
12776                                                             wider sync scope
12777                                                             and memory ordering
12778                                                             stronger than
12779                                                             unordered (this is
12780                                                             termed the
12781                                                             acquire-fence-paired-atomic)
12782                                                             has completed
12783                                                             before following
12784                                                             global memory
12785                                                             operations. This
12786                                                             satisfies the
12787                                                             requirements of
12788                                                             acquire.
12789                                                           - Ensures that all
12790                                                             previous memory
12791                                                             operations have
12792                                                             completed before a
12793                                                             following
12794                                                             local/generic store
12795                                                             atomic/atomicrmw
12796                                                             with an equal or
12797                                                             wider sync scope
12798                                                             and memory ordering
12799                                                             stronger than
12800                                                             unordered (this is
12801                                                             termed the
12802                                                             release-fence-paired-atomic).
12803                                                             This satisfies the
12804                                                             requirements of
12805                                                             release.
12806                                                           - Must happen before
12807                                                             the following
12808                                                             buffer_gl0_inv.
12809                                                           - Ensures that the
12810                                                             acquire-fence-paired
12811                                                             atomic has completed
12812                                                             before invalidating
12813                                                             the
12814                                                             cache. Therefore
12815                                                             any following
12816                                                             locations read must
12817                                                             be no older than
12818                                                             the value read by
12819                                                             the
12820                                                             acquire-fence-paired-atomic.
12821
12822                                                         3. buffer_gl0_inv
12823
12824                                                           - If CU wavefront execution
12825                                                             mode, omit.
12826                                                           - Ensures that
12827                                                             following
12828                                                             loads will not see
12829                                                             stale data.
12830
12831     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12832                               - system                     vmcnt(0) & vscnt(0)
12833
12834                                                           - If OpenCL and
12835                                                             address space is
12836                                                             not generic, omit
12837                                                             lgkmcnt(0).
12838                                                           - If OpenCL and
12839                                                             address space is
12840                                                             local, omit
12841                                                             vmcnt(0) and vscnt(0).
12842                                                           - However, since LLVM
12843                                                             currently has no
12844                                                             address space on
12845                                                             the fence need to
12846                                                             conservatively
12847                                                             always generate
12848                                                             (see comment for
12849                                                             previous fence).
12850                                                           - Could be split into
12851                                                             separate s_waitcnt
12852                                                             vmcnt(0), s_waitcnt
12853                                                             vscnt(0) and s_waitcnt
12854                                                             lgkmcnt(0) to allow
12855                                                             them to be
12856                                                             independently moved
12857                                                             according to the
12858                                                             following rules.
12859                                                           - s_waitcnt vmcnt(0)
12860                                                             must happen after
12861                                                             any preceding
12862                                                             global/generic
12863                                                             load/load
12864                                                             atomic/
12865                                                             atomicrmw-with-return-value.
12866                                                           - s_waitcnt vscnt(0)
12867                                                             must happen after
12868                                                             any preceding
12869                                                             global/generic
12870                                                             store/store atomic/
12871                                                             atomicrmw-no-return-value.
12872                                                           - s_waitcnt lgkmcnt(0)
12873                                                             must happen after
12874                                                             any preceding
12875                                                             local/generic
12876                                                             load/store/load
12877                                                             atomic/store
12878                                                             atomic/atomicrmw.
12879                                                           - Must happen before
12880                                                             the following
12881                                                             buffer_gl*_inv.
12882                                                           - Ensures that the
12883                                                             preceding
12884                                                             global/local/generic
12885                                                             load
12886                                                             atomic/atomicrmw
12887                                                             with an equal or
12888                                                             wider sync scope
12889                                                             and memory ordering
12890                                                             stronger than
12891                                                             unordered (this is
12892                                                             termed the
12893                                                             acquire-fence-paired-atomic)
12894                                                             has completed
12895                                                             before invalidating
12896                                                             the caches. This
12897                                                             satisfies the
12898                                                             requirements of
12899                                                             acquire.
12900                                                           - Ensures that all
12901                                                             previous memory
12902                                                             operations have
12903                                                             completed before a
12904                                                             following
12905                                                             global/local/generic
12906                                                             store
12907                                                             atomic/atomicrmw
12908                                                             with an equal or
12909                                                             wider sync scope
12910                                                             and memory ordering
12911                                                             stronger than
12912                                                             unordered (this is
12913                                                             termed the
12914                                                             release-fence-paired-atomic).
12915                                                             This satisfies the
12916                                                             requirements of
12917                                                             release.
12918
12919                                                         2. buffer_gl0_inv;
12920                                                            buffer_gl1_inv
12921
12922                                                           - Must happen before
12923                                                             any following
12924                                                             global/generic
12925                                                             load/load
12926                                                             atomic/store/store
12927                                                             atomic/atomicrmw.
12928                                                           - Ensures that
12929                                                             following loads
12930                                                             will not see stale
12931                                                             global data. This
12932                                                             satisfies the
12933                                                             requirements of
12934                                                             acquire.
12935
12936     **Sequential Consistent Atomic**
12937     ------------------------------------------------------------------------------------
12938     load atomic  seq_cst      - singlethread - global   *Same as corresponding
12939                               - wavefront    - local    load atomic acquire,
12940                                              - generic  except must generate
12941                                                         all instructions even
12942                                                         for OpenCL.*
12943     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12944                                              - generic     vmcnt(0) & vscnt(0)
12945
12946                                                           - If CU wavefront execution
12947                                                             mode, omit vmcnt(0) and
12948                                                             vscnt(0).
12949                                                           - Could be split into
12950                                                             separate s_waitcnt
12951                                                             vmcnt(0), s_waitcnt
12952                                                             vscnt(0), and s_waitcnt
12953                                                             lgkmcnt(0) to allow
12954                                                             them to be
12955                                                             independently moved
12956                                                             according to the
12957                                                             following rules.
12958                                                           - s_waitcnt lgkmcnt(0) must
12959                                                             happen after
12960                                                             preceding
12961                                                             local/generic load
12962                                                             atomic/store
12963                                                             atomic/atomicrmw
12964                                                             with memory
12965                                                             ordering of seq_cst
12966                                                             and with equal or
12967                                                             wider sync scope.
12968                                                             (Note that seq_cst
12969                                                             fences have their
12970                                                             own s_waitcnt
12971                                                             lgkmcnt(0) and so do
12972                                                             not need to be
12973                                                             considered.)
12974                                                           - s_waitcnt vmcnt(0)
12975                                                             must happen after
12976                                                             preceding
12977                                                             global/generic load
12978                                                             atomic/
12979                                                             atomicrmw-with-return-value
12980                                                             with memory
12981                                                             ordering of seq_cst
12982                                                             and with equal or
12983                                                             wider sync scope.
12984                                                             (Note that seq_cst
12985                                                             fences have their
12986                                                             own s_waitcnt
12987                                                             vmcnt(0) and so do
12988                                                             not need to be
12989                                                             considered.)
12990                                                           - s_waitcnt vscnt(0)
12991                                                             Must happen after
12992                                                             preceding
12993                                                             global/generic store
12994                                                             atomic/
12995                                                             atomicrmw-no-return-value
12996                                                             with memory
12997                                                             ordering of seq_cst
12998                                                             and with equal or
12999                                                             wider sync scope.
13000                                                             (Note that seq_cst
13001                                                             fences have their
13002                                                             own s_waitcnt
13003                                                             vscnt(0) and so do
13004                                                             not need to be
13005                                                             considered.)
13006                                                           - Ensures any
13007                                                             preceding
13008                                                             sequential
13009                                                             consistent global/local
13010                                                             memory instructions
13011                                                             have completed
13012                                                             before executing
13013                                                             this sequentially
13014                                                             consistent
13015                                                             instruction. This
13016                                                             prevents reordering
13017                                                             a seq_cst store
13018                                                             followed by a
13019                                                             seq_cst load. (Note
13020                                                             that seq_cst is
13021                                                             stronger than
13022                                                             acquire/release as
13023                                                             the reordering of
13024                                                             load acquire
13025                                                             followed by a store
13026                                                             release is
13027                                                             prevented by the
13028                                                             s_waitcnt of
13029                                                             the release, but
13030                                                             there is nothing
13031                                                             preventing a store
13032                                                             release followed by
13033                                                             load acquire from
13034                                                             completing out of
13035                                                             order. The s_waitcnt
13036                                                             could be placed after
13037                                                             seq_store or before
13038                                                             the seq_load. We
13039                                                             choose the load to
13040                                                             make the s_waitcnt be
13041                                                             as late as possible
13042                                                             so that the store
13043                                                             may have already
13044                                                             completed.)
13045
13046                                                         2. *Following
13047                                                            instructions same as
13048                                                            corresponding load
13049                                                            atomic acquire,
13050                                                            except must generate
13051                                                            all instructions even
13052                                                            for OpenCL.*
13053     load atomic  seq_cst      - workgroup    - local
13054
13055                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
13056
13057                                                           - If CU wavefront execution
13058                                                             mode, omit.
13059                                                           - Could be split into
13060                                                             separate s_waitcnt
13061                                                             vmcnt(0) and s_waitcnt
13062                                                             vscnt(0) to allow
13063                                                             them to be
13064                                                             independently moved
13065                                                             according to the
13066                                                             following rules.
13067                                                           - s_waitcnt vmcnt(0)
13068                                                             Must happen after
13069                                                             preceding
13070                                                             global/generic load
13071                                                             atomic/
13072                                                             atomicrmw-with-return-value
13073                                                             with memory
13074                                                             ordering of seq_cst
13075                                                             and with equal or
13076                                                             wider sync scope.
13077                                                             (Note that seq_cst
13078                                                             fences have their
13079                                                             own s_waitcnt
13080                                                             vmcnt(0) and so do
13081                                                             not need to be
13082                                                             considered.)
13083                                                           - s_waitcnt vscnt(0)
13084                                                             Must happen after
13085                                                             preceding
13086                                                             global/generic store
13087                                                             atomic/
13088                                                             atomicrmw-no-return-value
13089                                                             with memory
13090                                                             ordering of seq_cst
13091                                                             and with equal or
13092                                                             wider sync scope.
13093                                                             (Note that seq_cst
13094                                                             fences have their
13095                                                             own s_waitcnt
13096                                                             vscnt(0) and so do
13097                                                             not need to be
13098                                                             considered.)
13099                                                           - Ensures any
13100                                                             preceding
13101                                                             sequential
13102                                                             consistent global
13103                                                             memory instructions
13104                                                             have completed
13105                                                             before executing
13106                                                             this sequentially
13107                                                             consistent
13108                                                             instruction. This
13109                                                             prevents reordering
13110                                                             a seq_cst store
13111                                                             followed by a
13112                                                             seq_cst load. (Note
13113                                                             that seq_cst is
13114                                                             stronger than
13115                                                             acquire/release as
13116                                                             the reordering of
13117                                                             load acquire
13118                                                             followed by a store
13119                                                             release is
13120                                                             prevented by the
13121                                                             s_waitcnt of
13122                                                             the release, but
13123                                                             there is nothing
13124                                                             preventing a store
13125                                                             release followed by
13126                                                             load acquire from
13127                                                             completing out of
13128                                                             order. The s_waitcnt
13129                                                             could be placed after
13130                                                             seq_store or before
13131                                                             the seq_load. We
13132                                                             choose the load to
13133                                                             make the s_waitcnt be
13134                                                             as late as possible
13135                                                             so that the store
13136                                                             may have already
13137                                                             completed.)
13138
13139                                                         2. *Following
13140                                                            instructions same as
13141                                                            corresponding load
13142                                                            atomic acquire,
13143                                                            except must generate
13144                                                            all instructions even
13145                                                            for OpenCL.*
13146
13147     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13148                               - system       - generic     vmcnt(0) & vscnt(0)
13149
13150                                                           - Could be split into
13151                                                             separate s_waitcnt
13152                                                             vmcnt(0), s_waitcnt
13153                                                             vscnt(0) and s_waitcnt
13154                                                             lgkmcnt(0) to allow
13155                                                             them to be
13156                                                             independently moved
13157                                                             according to the
13158                                                             following rules.
13159                                                           - s_waitcnt lgkmcnt(0)
13160                                                             must happen after
13161                                                             preceding
13162                                                             local load
13163                                                             atomic/store
13164                                                             atomic/atomicrmw
13165                                                             with memory
13166                                                             ordering of seq_cst
13167                                                             and with equal or
13168                                                             wider sync scope.
13169                                                             (Note that seq_cst
13170                                                             fences have their
13171                                                             own s_waitcnt
13172                                                             lgkmcnt(0) and so do
13173                                                             not need to be
13174                                                             considered.)
13175                                                           - s_waitcnt vmcnt(0)
13176                                                             must happen after
13177                                                             preceding
13178                                                             global/generic load
13179                                                             atomic/
13180                                                             atomicrmw-with-return-value
13181                                                             with memory
13182                                                             ordering of seq_cst
13183                                                             and with equal or
13184                                                             wider sync scope.
13185                                                             (Note that seq_cst
13186                                                             fences have their
13187                                                             own s_waitcnt
13188                                                             vmcnt(0) and so do
13189                                                             not need to be
13190                                                             considered.)
13191                                                           - s_waitcnt vscnt(0)
13192                                                             Must happen after
13193                                                             preceding
13194                                                             global/generic store
13195                                                             atomic/
13196                                                             atomicrmw-no-return-value
13197                                                             with memory
13198                                                             ordering of seq_cst
13199                                                             and with equal or
13200                                                             wider sync scope.
13201                                                             (Note that seq_cst
13202                                                             fences have their
13203                                                             own s_waitcnt
13204                                                             vscnt(0) and so do
13205                                                             not need to be
13206                                                             considered.)
13207                                                           - Ensures any
13208                                                             preceding
13209                                                             sequential
13210                                                             consistent global
13211                                                             memory instructions
13212                                                             have completed
13213                                                             before executing
13214                                                             this sequentially
13215                                                             consistent
13216                                                             instruction. This
13217                                                             prevents reordering
13218                                                             a seq_cst store
13219                                                             followed by a
13220                                                             seq_cst load. (Note
13221                                                             that seq_cst is
13222                                                             stronger than
13223                                                             acquire/release as
13224                                                             the reordering of
13225                                                             load acquire
13226                                                             followed by a store
13227                                                             release is
13228                                                             prevented by the
13229                                                             s_waitcnt of
13230                                                             the release, but
13231                                                             there is nothing
13232                                                             preventing a store
13233                                                             release followed by
13234                                                             load acquire from
13235                                                             completing out of
13236                                                             order. The s_waitcnt
13237                                                             could be placed after
13238                                                             seq_store or before
13239                                                             the seq_load. We
13240                                                             choose the load to
13241                                                             make the s_waitcnt be
13242                                                             as late as possible
13243                                                             so that the store
13244                                                             may have already
13245                                                             completed.)
13246
13247                                                         2. *Following
13248                                                            instructions same as
13249                                                            corresponding load
13250                                                            atomic acquire,
13251                                                            except must generate
13252                                                            all instructions even
13253                                                            for OpenCL.*
13254     store atomic seq_cst      - singlethread - global   *Same as corresponding
13255                               - wavefront    - local    store atomic release,
13256                               - workgroup    - generic  except must generate
13257                               - agent                   all instructions even
13258                               - system                  for OpenCL.*
13259     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
13260                               - wavefront    - local    atomicrmw acq_rel,
13261                               - workgroup    - generic  except must generate
13262                               - agent                   all instructions even
13263                               - system                  for OpenCL.*
13264     fence        seq_cst      - singlethread *none*     *Same as corresponding
13265                               - wavefront               fence acq_rel,
13266                               - workgroup               except must generate
13267                               - agent                   all instructions even
13268                               - system                  for OpenCL.*
13269     ============ ============ ============== ========== ================================
13270
13271.. _amdgpu-amdhsa-trap-handler-abi:
13272
13273Trap Handler ABI
13274~~~~~~~~~~~~~~~~
13275
13276For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13277runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13278supports the ``s_trap`` instruction. For usage see:
13279
13280- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13281- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13282- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13283
13284  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13285     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13286
13287     =================== =============== =============== =======================================
13288     Usage               Code Sequence   Trap Handler    Description
13289                                         Inputs
13290     =================== =============== =============== =======================================
13291     reserved            ``s_trap 0x00``                 Reserved by hardware.
13292     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
13293                                           ``queue_ptr`` intrinsic (not implemented).
13294                                         ``VGPR0``:
13295                                           ``arg``
13296     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13297                                           ``queue_ptr`` the trap instruction. The associated
13298                                                         queue is signalled to put it into the
13299                                                         error state.  When the queue is put in
13300                                                         the error state, the waves executing
13301                                                         dispatches on the queue will be
13302                                                         terminated.
13303     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13304                                                           as a no-operation. The trap handler
13305                                                           is entered and immediately returns to
13306                                                           continue execution of the wavefront.
13307                                                         - If the debugger is enabled, causes
13308                                                           the debug trap to be reported by the
13309                                                           debugger and the wavefront is put in
13310                                                           the halt state with the PC at the
13311                                                           instruction.  The debugger must
13312                                                           increment the PC and resume the wave.
13313     reserved            ``s_trap 0x04``                 Reserved.
13314     reserved            ``s_trap 0x05``                 Reserved.
13315     reserved            ``s_trap 0x06``                 Reserved.
13316     reserved            ``s_trap 0x07``                 Reserved.
13317     reserved            ``s_trap 0x08``                 Reserved.
13318     reserved            ``s_trap 0xfe``                 Reserved.
13319     reserved            ``s_trap 0xff``                 Reserved.
13320     =================== =============== =============== =======================================
13321
13322..
13323
13324  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13325     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13326
13327     =================== =============== =============== =======================================
13328     Usage               Code Sequence   Trap Handler    Description
13329                                         Inputs
13330     =================== =============== =============== =======================================
13331     reserved            ``s_trap 0x00``                 Reserved by hardware.
13332     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
13333                                                         breakpoints. Causes wave to be halted
13334                                                         with the PC at the trap instruction.
13335                                                         The debugger is responsible to resume
13336                                                         the wave, including the instruction
13337                                                         that the breakpoint overwrote.
13338     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13339                                           ``queue_ptr`` the trap instruction. The associated
13340                                                         queue is signalled to put it into the
13341                                                         error state.  When the queue is put in
13342                                                         the error state, the waves executing
13343                                                         dispatches on the queue will be
13344                                                         terminated.
13345     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13346                                                           as a no-operation. The trap handler
13347                                                           is entered and immediately returns to
13348                                                           continue execution of the wavefront.
13349                                                         - If the debugger is enabled, causes
13350                                                           the debug trap to be reported by the
13351                                                           debugger and the wavefront is put in
13352                                                           the halt state with the PC at the
13353                                                           instruction.  The debugger must
13354                                                           increment the PC and resume the wave.
13355     reserved            ``s_trap 0x04``                 Reserved.
13356     reserved            ``s_trap 0x05``                 Reserved.
13357     reserved            ``s_trap 0x06``                 Reserved.
13358     reserved            ``s_trap 0x07``                 Reserved.
13359     reserved            ``s_trap 0x08``                 Reserved.
13360     reserved            ``s_trap 0xfe``                 Reserved.
13361     reserved            ``s_trap 0xff``                 Reserved.
13362     =================== =============== =============== =======================================
13363
13364..
13365
13366  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13367     :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13368
13369     =================== =============== ================ ================= =======================================
13370     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13371     =================== =============== ================ ================= =======================================
13372     reserved            ``s_trap 0x00``                                    Reserved by hardware.
13373     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
13374                                                                            breakpoints. Causes wave to be halted
13375                                                                            with the PC at the trap instruction.
13376                                                                            The debugger is responsible to resume
13377                                                                            the wave, including the instruction
13378                                                                            that the breakpoint overwrote.
13379     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
13380                                           ``queue_ptr``                    the trap instruction. The associated
13381                                                                            queue is signalled to put it into the
13382                                                                            error state.  When the queue is put in
13383                                                                            the error state, the waves executing
13384                                                                            dispatches on the queue will be
13385                                                                            terminated.
13386     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
13387                                                                              as a no-operation. The trap handler
13388                                                                              is entered and immediately returns to
13389                                                                              continue execution of the wavefront.
13390                                                                            - If the debugger is enabled, causes
13391                                                                              the debug trap to be reported by the
13392                                                                              debugger and the wavefront is put in
13393                                                                              the halt state with the PC at the
13394                                                                              instruction.  The debugger must
13395                                                                              increment the PC and resume the wave.
13396     reserved            ``s_trap 0x04``                                    Reserved.
13397     reserved            ``s_trap 0x05``                                    Reserved.
13398     reserved            ``s_trap 0x06``                                    Reserved.
13399     reserved            ``s_trap 0x07``                                    Reserved.
13400     reserved            ``s_trap 0x08``                                    Reserved.
13401     reserved            ``s_trap 0xfe``                                    Reserved.
13402     reserved            ``s_trap 0xff``                                    Reserved.
13403     =================== =============== ================ ================= =======================================
13404
13405.. _amdgpu-amdhsa-function-call-convention:
13406
13407Call Convention
13408~~~~~~~~~~~~~~~
13409
13410.. note::
13411
13412  This section is currently incomplete and has inaccuracies. It is WIP that will
13413  be updated as information is determined.
13414
13415See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13416addresses. Unswizzled addresses are normal linear addresses.
13417
13418.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13419
13420Kernel Functions
13421++++++++++++++++
13422
13423This section describes the call convention ABI for the outer kernel function.
13424
13425See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13426convention.
13427
13428The following is not part of the AMDGPU kernel calling convention but describes
13429how the AMDGPU implements function calls:
13430
134311.  Clang decides the kernarg layout to match the *HSA Programmer's Language
13432    Reference* [HSA]_.
13433
13434    - All structs are passed directly.
13435    - Lambda values are passed *TBA*.
13436
13437    .. TODO::
13438
13439      - Does this really follow HSA rules? Or are structs >16 bytes passed
13440        by-value struct?
13441      - What is ABI for lambda values?
13442
134434.  The kernel performs certain setup in its prolog, as described in
13444    :ref:`amdgpu-amdhsa-kernel-prolog`.
13445
13446.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13447
13448Non-Kernel Functions
13449++++++++++++++++++++
13450
13451This section describes the call convention ABI for functions other than the
13452outer kernel function.
13453
13454If a kernel has function calls then scratch is always allocated and used for
13455the call stack which grows from low address to high address using the swizzled
13456scratch address space.
13457
13458On entry to a function:
13459
134601.  SGPR0-3 contain a V# with the following properties (see
13461    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13462
13463    * Base address pointing to the beginning of the wavefront scratch backing
13464      memory.
13465    * Swizzled with dword element size and stride of wavefront size elements.
13466
134672.  The FLAT_SCRATCH register pair is setup. See
13468    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
134693.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13470    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
134714.  The EXEC register is set to the lanes active on entry to the function.
134725.  MODE register: *TBD*
134736.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13474    below.
134757.  SGPR30-31 return address (RA). The code address that the function must
13476    return to when it completes. The value is undefined if the function is *no
13477    return*.
134788.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13479    offset relative to the beginning of the wavefront scratch backing memory.
13480
13481    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13482    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13483    manner.
13484
13485    The unswizzled SP value can be converted into the swizzled SP value by:
13486
13487      | swizzled SP = unswizzled SP / wavefront size
13488
13489    This may be used to obtain the private address space address of stack
13490    objects and to convert this address to a flat address by adding the flat
13491    scratch aperture base address.
13492
13493    The swizzled SP value is always 4 bytes aligned for the ``r600``
13494    architecture and 16 byte aligned for the ``amdgcn`` architecture.
13495
13496    .. note::
13497
13498      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13499      OpenCL language which has the largest base type defined as 16 bytes.
13500
13501    On entry, the swizzled SP value is the address of the first function
13502    argument passed on the stack. Other stack passed arguments are positive
13503    offsets from the entry swizzled SP value.
13504
13505    The function may use positive offsets beyond the last stack passed argument
13506    for stack allocated local variables and register spill slots. If necessary,
13507    the function may align these to greater alignment than 16 bytes. After these
13508    the function may dynamically allocate space for such things as runtime sized
13509    ``alloca`` local allocations.
13510
13511    If the function calls another function, it will place any stack allocated
13512    arguments after the last local allocation and adjust SGPR32 to the address
13513    after the last local allocation.
13514
135159.  All other registers are unspecified.
1351610. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13517    to the function.
13518
13519On exit from a function:
13520
135211.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13522    described below. Any registers used are considered clobbered registers.
135232.  The following registers are preserved and have the same value as on entry:
13524
13525    * FLAT_SCRATCH
13526    * EXEC
13527    * GFX6-GFX8: M0
13528    * All SGPR registers except the clobbered registers of SGPR4-31.
13529    * VGPR40-47
13530    * VGPR56-63
13531    * VGPR72-79
13532    * VGPR88-95
13533    * VGPR104-111
13534    * VGPR120-127
13535    * VGPR136-143
13536    * VGPR152-159
13537    * VGPR168-175
13538    * VGPR184-191
13539    * VGPR200-207
13540    * VGPR216-223
13541    * VGPR232-239
13542    * VGPR248-255
13543
13544        .. note::
13545
13546          Except the argument registers, the VGPRs clobbered and the preserved
13547          registers are intermixed at regular intervals in order to keep a
13548          similar ratio independent of the number of allocated VGPRs.
13549
13550    * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13551    * Lanes of all VGPRs that are inactive at the call site.
13552
13553      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13554      optimization may mark some of clobbered SGPR and VGPR registers as
13555      preserved if it can be determined that the called function does not change
13556      their value.
13557
135582.  The PC is set to the RA provided on entry.
135593.  MODE register: *TBD*.
135604.  All other registers are clobbered.
135615.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13562    function is available to the caller.
13563
13564.. TODO::
13565
13566  - How are function results returned? The address of structured types is passed
13567    by reference, but what about other types?
13568
13569The function input arguments are made up of the formal arguments explicitly
13570declared by the source language function plus the implicit input arguments used
13571by the implementation.
13572
13573The source language input arguments are:
13574
135751. Any source language implicit ``this`` or ``self`` argument comes first as a
13576   pointer type.
135772. Followed by the function formal arguments in left to right source order.
13578
13579The source language result arguments are:
13580
135811. The function result argument.
13582
13583The source language input or result struct type arguments that are less than or
13584equal to 16 bytes, are decomposed recursively into their base type fields, and
13585each field is passed as if a separate argument. For input arguments, if the
13586called function requires the struct to be in memory, for example because its
13587address is taken, then the function body is responsible for allocating a stack
13588location and copying the field arguments into it. Clang terms this *direct
13589struct*.
13590
13591The source language input struct type arguments that are greater than 16 bytes,
13592are passed by reference. The caller is responsible for allocating a stack
13593location to make a copy of the struct value and pass the address as the input
13594argument. The called function is responsible to perform the dereference when
13595accessing the input argument. Clang terms this *by-value struct*.
13596
13597A source language result struct type argument that is greater than 16 bytes, is
13598returned by reference. The caller is responsible for allocating a stack location
13599to hold the result value and passes the address as the last input argument
13600(before the implicit input arguments). In this case there are no result
13601arguments. The called function is responsible to perform the dereference when
13602storing the result value. Clang terms this *structured return (sret)*.
13603
13604*TODO: correct the ``sret`` definition.*
13605
13606.. TODO::
13607
13608  Is this definition correct? Or is ``sret`` only used if passing in registers, and
13609  pass as non-decomposed struct as stack argument? Or something else? Is the
13610  memory location in the caller stack frame, or a stack memory argument and so
13611  no address is passed as the caller can directly write to the argument stack
13612  location? But then the stack location is still live after return. If an
13613  argument stack location is it the first stack argument or the last one?
13614
13615Lambda argument types are treated as struct types with an implementation defined
13616set of fields.
13617
13618.. TODO::
13619
13620  Need to specify the ABI for lambda types for AMDGPU.
13621
13622For AMDGPU backend all source language arguments (including the decomposed
13623struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13624they are passed in SGPRs.
13625
13626The AMDGPU backend walks the function call graph from the leaves to determine
13627which implicit input arguments are used, propagating to each caller of the
13628function. The used implicit arguments are appended to the function arguments
13629after the source language arguments in the following order:
13630
13631.. TODO::
13632
13633  Is recursion or external functions supported?
13634
136351.  Work-Item ID (1 VGPR)
13636
13637    The X, Y and Z work-item ID are packed into a single VGRP with the following
13638    layout. Only fields actually used by the function are set. The other bits
13639    are undefined.
13640
13641    The values come from the initial kernel execution state. See
13642    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13643
13644    .. table:: Work-item implicit argument layout
13645      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13646
13647      ======= ======= ==============
13648      Bits    Size    Field Name
13649      ======= ======= ==============
13650      9:0     10 bits X Work-Item ID
13651      19:10   10 bits Y Work-Item ID
13652      29:20   10 bits Z Work-Item ID
13653      31:30   2 bits  Unused
13654      ======= ======= ==============
13655
136562.  Dispatch Ptr (2 SGPRs)
13657
13658    The value comes from the initial kernel execution state. See
13659    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13660
136613.  Queue Ptr (2 SGPRs)
13662
13663    The value comes from the initial kernel execution state. See
13664    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13665
136664.  Kernarg Segment Ptr (2 SGPRs)
13667
13668    The value comes from the initial kernel execution state. See
13669    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13670
136715.  Dispatch id (2 SGPRs)
13672
13673    The value comes from the initial kernel execution state. See
13674    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13675
136766.  Work-Group ID X (1 SGPR)
13677
13678    The value comes from the initial kernel execution state. See
13679    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13680
136817.  Work-Group ID Y (1 SGPR)
13682
13683    The value comes from the initial kernel execution state. See
13684    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13685
136868.  Work-Group ID Z (1 SGPR)
13687
13688    The value comes from the initial kernel execution state. See
13689    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13690
136919.  Implicit Argument Ptr (2 SGPRs)
13692
13693    The value is computed by adding an offset to Kernarg Segment Ptr to get the
13694    global address space pointer to the first kernarg implicit argument.
13695
13696The input and result arguments are assigned in order in the following manner:
13697
13698.. note::
13699
13700  There are likely some errors and omissions in the following description that
13701  need correction.
13702
13703  .. TODO::
13704
13705    Check the Clang source code to decipher how function arguments and return
13706    results are handled. Also see the AMDGPU specific values used.
13707
13708* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13709  VGPR31.
13710
13711  If there are more arguments than will fit in these registers, the remaining
13712  arguments are allocated on the stack in order on naturally aligned
13713  addresses.
13714
13715  .. TODO::
13716
13717    How are overly aligned structures allocated on the stack?
13718
13719* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13720  SGPR29.
13721
13722  If there are more arguments than will fit in these registers, the remaining
13723  arguments are allocated on the stack in order on naturally aligned
13724  addresses.
13725
13726Note that decomposed struct type arguments may have some fields passed in
13727registers and some in memory.
13728
13729.. TODO::
13730
13731  So, a struct which can pass some fields as decomposed register arguments, will
13732  pass the rest as decomposed stack elements? But an argument that will not start
13733  in registers will not be decomposed and will be passed as a non-decomposed
13734  stack value?
13735
13736The following is not part of the AMDGPU function calling convention but
13737describes how the AMDGPU implements function calls:
13738
137391.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13740    unswizzled scratch address. It is only needed if runtime sized ``alloca``
13741    are used, or for the reasons defined in ``SIFrameLowering``.
137422.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13743    to access the incoming stack arguments in the function. The BP is needed
13744    only when the function requires the runtime stack alignment.
13745
137463.  Allocating SGPR arguments on the stack are not supported.
13747
137484.  No CFI is currently generated. See
13749    :ref:`amdgpu-dwarf-call-frame-information`.
13750
13751    .. note::
13752
13753      CFI will be generated that defines the CFA as the unswizzled address
13754      relative to the wave scratch base in the unswizzled private address space
13755      of the lowest address stack allocated local variable.
13756
13757      ``DW_AT_frame_base`` will be defined as the swizzled address in the
13758      swizzled private address space by dividing the CFA by the wavefront size
13759      (since CFA is always at least dword aligned which matches the scratch
13760      swizzle element size).
13761
13762      If no dynamic stack alignment was performed, the stack allocated arguments
13763      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13764      local variables and register spill slots are accessed as positive offsets
13765      relative to ``DW_AT_frame_base``.
13766
137675.  Function argument passing is implemented by copying the input physical
13768    registers to virtual registers on entry. The register allocator can spill if
13769    necessary. These are copied back to physical registers at call sites. The
13770    net effect is that each function call can have these values in entirely
13771    distinct locations. The IPRA can help avoid shuffling argument registers.
137726.  Call sites are implemented by setting up the arguments at positive offsets
13773    from SP. Then SP is incremented to account for the known frame size before
13774    the call and decremented after the call.
13775
13776    .. note::
13777
13778      The CFI will reflect the changed calculation needed to compute the CFA
13779      from SP.
13780
137817.  4 byte spill slots are used in the stack frame. One slot is allocated for an
13782    emergency spill slot. Buffer instructions are used for stack accesses and
13783    not the ``flat_scratch`` instruction.
13784
13785    .. TODO::
13786
13787      Explain when the emergency spill slot is used.
13788
13789.. TODO::
13790
13791  Possible broken issues:
13792
13793  - Stack arguments must be aligned to required alignment.
13794  - Stack is aligned to max(16, max formal argument alignment)
13795  - Direct argument < 64 bits should check register budget.
13796  - Register budget calculation should respect ``inreg`` for SGPR.
13797  - SGPR overflow is not handled.
13798  - struct with 1 member unpeeling is not checking size of member.
13799  - ``sret`` is after ``this`` pointer.
13800  - Caller is not implementing stack realignment: need an extra pointer.
13801  - Should say AMDGPU passes FP rather than SP.
13802  - Should CFI define CFA as address of locals or arguments. Difference is
13803    apparent when have implemented dynamic alignment.
13804  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13805    highest address of stack frame and use negative offset for locals. Would
13806    allow SP to be the same as FP and could support signal-handler-like as now
13807    have a real SP for the top of the stack.
13808  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13809    arguments?
13810
13811AMDPAL
13812------
13813
13814This section provides code conventions used when the target triple OS is
13815``amdpal`` (see :ref:`amdgpu-target-triples`).
13816
13817.. _amdgpu-amdpal-code-object-metadata-section:
13818
13819Code Object Metadata
13820~~~~~~~~~~~~~~~~~~~~
13821
13822.. note::
13823
13824  The metadata is currently in development and is subject to major
13825  changes. Only the current version is supported. *When this document
13826  was generated the version was 2.6.*
13827
13828Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13829record (see :ref:`amdgpu-note-records-v3-onwards`).
13830
13831The metadata is represented as Message Pack formatted binary data (see
13832[MsgPack]_). The top level is a Message Pack map that includes the keys
13833defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13834and referenced tables.
13835
13836Additional information can be added to the maps. To avoid conflicts, any
13837key names should be prefixed by "*vendor-name*." where ``vendor-name``
13838can be the name of the vendor and specific vendor tool that generates the
13839information. The prefix is abbreviated to simply "." when it appears
13840within a map that has been added by the same *vendor-name*.
13841
13842  .. table:: AMDPAL Code Object Metadata Map
13843     :name: amdgpu-amdpal-code-object-metadata-map-table
13844
13845     =================== ============== ========= ======================================================================
13846     String Key          Value Type     Required? Description
13847     =================== ============== ========= ======================================================================
13848     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
13849                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13850     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
13851                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13852                                                  definition of the keys included in that map.
13853     =================== ============== ========= ======================================================================
13854
13855..
13856
13857  .. table:: AMDPAL Code Object Pipeline Metadata Map
13858     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13859
13860     ====================================== ============== ========= ===================================================
13861     String Key                             Value Type     Required? Description
13862     ====================================== ============== ========= ===================================================
13863     ".name"                                string                   Source name of the pipeline.
13864     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
13865
13866                                                                       - "VsPs"
13867                                                                       - "Gs"
13868                                                                       - "Cs"
13869                                                                       - "Ngg"
13870                                                                       - "Tess"
13871                                                                       - "GsTess"
13872                                                                       - "NggTess"
13873
13874     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
13875                                            2 integers               64 bits is the "stable" portion of the hash, used
13876                                                                     for e.g. shader replacement lookup. Upper 64 bits
13877                                                                     is the "unique" portion of the hash, used for
13878                                                                     e.g. pipeline cache lookup. The value is
13879                                                                     implementation defined, and can not be relied on
13880                                                                     between different builds of the compiler.
13881     ".shaders"                             map                      Per-API shader metadata. See
13882                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
13883                                                                     for the definition of the keys included in that
13884                                                                     map.
13885     ".hardware_stages"                     map                      Per-hardware stage metadata. See
13886                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13887                                                                     for the definition of the keys included in that
13888                                                                     map.
13889     ".shader_functions"                    map                      Per-shader function metadata. See
13890                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13891                                                                     for the definition of the keys included in that
13892                                                                     map.
13893     ".registers"                           map            Required  Hardware register configuration. See
13894                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
13895                                                                     for the definition of the keys included in that
13896                                                                     map.
13897     ".user_data_limit"                     integer                  Number of user data entries accessed by this
13898                                                                     pipeline.
13899     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
13900                                                                     NoUserDataSpilling.
13901     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
13902                                                                     viewport array index feature. Pipelines which use
13903                                                                     this feature can render into all 16 viewports,
13904                                                                     whereas pipelines which do not use it are
13905                                                                     restricted to viewport #0.
13906     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
13907                                                                     handling data-passing between the ES and GS
13908                                                                     shader stages. This can be zero if the data is
13909                                                                     passed using off-chip buffers. This value should
13910                                                                     be used to program all user-SGPRs which have been
13911                                                                     marked with "UserDataMapping::EsGsLdsSize"
13912                                                                     (typically only the GS and VS HW stages will ever
13913                                                                     have a user-SGPR so marked).
13914     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
13915                                                                     (maximum number of threads in a subgroup).
13916     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
13917     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
13918     ".api"                                 string                   Name of the client graphics API.
13919     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
13920                                                                     be defined by the driver using the compiler if
13921                                                                     they want to be able to correlate API-specific
13922                                                                     information used during creation at a later time.
13923     ====================================== ============== ========= ===================================================
13924
13925..
13926
13927  .. table:: AMDPAL Code Object Shader Map
13928     :name: amdgpu-amdpal-code-object-shader-map-table
13929
13930
13931     +-------------+--------------+-------------------------------------------------------------------+
13932     |String Key   |Value Type    |Description                                                        |
13933     +=============+==============+===================================================================+
13934     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
13935     |- ".vertex"  |              |for the definition of the keys included in that map.               |
13936     |- ".hull"    |              |                                                                   |
13937     |- ".domain"  |              |                                                                   |
13938     |- ".geometry"|              |                                                                   |
13939     |- ".pixel"   |              |                                                                   |
13940     +-------------+--------------+-------------------------------------------------------------------+
13941
13942..
13943
13944  .. table:: AMDPAL Code Object API Shader Metadata Map
13945     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
13946
13947     ==================== ============== ========= =====================================================================
13948     String Key           Value Type     Required? Description
13949     ==================== ============== ========= =====================================================================
13950     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
13951                          2 integers               is implementation defined, and can not be relied on between
13952                                                   different builds of the compiler.
13953     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
13954                          string                   include:
13955
13956                                                     - ".ls"
13957                                                     - ".hs"
13958                                                     - ".es"
13959                                                     - ".gs"
13960                                                     - ".vs"
13961                                                     - ".ps"
13962                                                     - ".cs"
13963
13964     ==================== ============== ========= =====================================================================
13965
13966..
13967
13968  .. table:: AMDPAL Code Object Hardware Stage Map
13969     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
13970
13971     +-------------+--------------+-----------------------------------------------------------------------+
13972     |String Key   |Value Type    |Description                                                            |
13973     +=============+==============+=======================================================================+
13974     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
13975     |- ".hs"      |              |for the definition of the keys included in that map.                   |
13976     |- ".es"      |              |                                                                       |
13977     |- ".gs"      |              |                                                                       |
13978     |- ".vs"      |              |                                                                       |
13979     |- ".ps"      |              |                                                                       |
13980     |- ".cs"      |              |                                                                       |
13981     +-------------+--------------+-----------------------------------------------------------------------+
13982
13983..
13984
13985  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
13986     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
13987
13988     ========================== ============== ========= ===============================================================
13989     String Key                 Value Type     Required? Description
13990     ========================== ============== ========= ===============================================================
13991     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
13992     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
13993     ".lds_size"                integer                  Local Data Share size in bytes.
13994     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
13995     ".vgpr_count"              integer                  Number of VGPRs used.
13996     ".agpr_count"              integer                  Number of AGPRs used.
13997     ".sgpr_count"              integer                  Number of SGPRs used.
13998     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
13999                                                         directive to instruct the compiler to limit the VGPR usage to
14000                                                         be less than or equal to the specified value (only set if
14001                                                         different from HW default).
14002     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
14003                                                         default).
14004     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
14005                                3 integers
14006     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
14007     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
14008     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
14009     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
14010     ".writes_depth"            boolean                  The shader writes out a depth value.
14011     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
14012                                                         memory or GDS.
14013     ".uses_prim_id"            boolean                  The shader uses PrimID.
14014     ========================== ============== ========= ===============================================================
14015
14016..
14017
14018  .. table:: AMDPAL Code Object Shader Function Map
14019     :name: amdgpu-amdpal-code-object-shader-function-map-table
14020
14021     =============== ============== ====================================================================
14022     String Key      Value Type     Description
14023     =============== ============== ====================================================================
14024     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
14025                                    entry address. The value is the function's metadata. See
14026                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14027     =============== ============== ====================================================================
14028
14029..
14030
14031  .. table:: AMDPAL Code Object Shader Function Metadata Map
14032     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14033
14034     ============================= ============== =================================================================
14035     String Key                    Value Type     Description
14036     ============================= ============== =================================================================
14037     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
14038                                   2 integers     is implementation defined, and can not be relied on between
14039                                                  different builds of the compiler.
14040     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
14041     ".lds_size"                   integer        Size in bytes of LDS memory.
14042     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
14043     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
14044     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
14045     ".shader_subtype"             string         Shader subtype/kind. Values include:
14046
14047                                                    - "Unknown"
14048
14049     ============================= ============== =================================================================
14050
14051..
14052
14053  .. table:: AMDPAL Code Object Register Map
14054     :name: amdgpu-amdpal-code-object-register-map-table
14055
14056     ========================== ============== ====================================================================
14057     32-bit Integer Key         Value Type     Description
14058     ========================== ============== ====================================================================
14059     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14060                                               a GRBM register (i.e., driver accessible GPU register number, not
14061                                               shader GPR register number). The driver is required to program each
14062                                               specified register to the corresponding specified value when
14063                                               executing this pipeline. Typically, the ``reg offsets`` are the
14064                                               ``uint16_t`` offsets to each register as defined by the hardware
14065                                               chip headers. The register is set to the provided value. However, a
14066                                               ``reg offset`` that specifies a user data register (e.g.,
14067                                               COMPUTE_USER_DATA_0) needs special treatment. See
14068                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14069                                               information.
14070     ========================== ============== ====================================================================
14071
14072.. _amdgpu-amdpal-code-object-user-data-section:
14073
14074User Data
14075+++++++++
14076
14077Each hardware stage has a set of 32-bit physical SPI *user data registers*
14078(either 16 or 32 based on graphics IP and the stage) which can be
14079written from a command buffer and then loaded into SGPRs when waves are
14080launched via a subsequent dispatch or draw operation. This is the way
14081most arguments are passed from the application/runtime to a hardware
14082shader.
14083
14084PAL abstracts this functionality by exposing a set of 128 *user data
14085entries* per pipeline a client can use to pass arguments from a command
14086buffer to one or more shaders in that pipeline. The ELF code object must
14087specify a mapping from virtualized *user data entries* to physical *user
14088data registers*, and PAL is responsible for implementing that mapping,
14089including spilling overflow *user data entries* to memory if needed.
14090
14091Since the *user data registers* are GRBM-accessible SPI registers, this
14092mapping is actually embedded in the ``.registers`` metadata entry. For
14093most registers, the value in that map is a literal 32-bit value that
14094should be written to the register by the driver. However, when the
14095register is a *user data register* (any USER_DATA register e.g.,
14096SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14097the driver to write either a *user data entry* value or one of several
14098driver-internal values to the register. This encoding is described in
14099the following table:
14100
14101.. note::
14102
14103  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14104  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14105  always be programmed to the address of the GlobalTable, and *user data
14106  register* 1 must always be programmed to the address of the PerShaderTable.
14107
14108..
14109
14110  .. table:: AMDPAL User Data Mapping
14111     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14112
14113     ==========  =================  ===============================================================================
14114     Value       Name               Description
14115     ==========  =================  ===============================================================================
14116     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14117     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
14118                                    always point to *user data register* 0).
14119     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
14120                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14121                                    for more detail (should always point to *user data register* 1).
14122     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
14123                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14124                                    more detail.
14125     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14126                                    reference the draw index in the vertex shader. Only supported by the first
14127                                    stage in a graphics pipeline.
14128     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
14129                                    a graphics pipeline.
14130     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
14131                                    graphics pipeline.
14132     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14133                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
14134                                    high half of the address is stored in the next sequential user-SGPR. Only
14135                                    supported by compute pipelines.
14136     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
14137                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
14138                                    stages.
14139     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
14140                                    pipeline instancing.
14141     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
14142                                    can only appear for one shader stage per pipeline.
14143     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
14144     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
14145                                    only appear for one shader stage per pipeline.
14146     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
14147                                    only appear for one shader stage per pipeline (PS). These replace color targets
14148                                    and are completely separate from any UAVs used by the shader. This is optional,
14149                                    and only used by the PS when UAV exports are used to replace color-target
14150                                    exports to optimize specific shaders.
14151     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
14152                                    some NGG pipelines to perform culling.  This value contains the address of the
14153                                    first of two consecutive registers which provide the full GPU address.
14154     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
14155     ==========  =================  ===============================================================================
14156
14157.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14158
14159Per-Shader Table
14160################
14161
14162Low 32 bits of the GPU address for an optional buffer in the ``.data``
14163section of the ELF. The high 32 bits of the address match the high 32 bits
14164of the shader's program counter.
14165
14166The buffer can be anything the shader compiler needs it for, and
14167allows each shader to have its own region of the ``.data`` section.
14168Typically, this could be a table of buffer SRD's and the data pointed to
14169by the buffer SRD's, but it could be a flat-address region of memory as
14170well. Its layout and usage are defined by the shader compiler.
14171
14172Each shader's table in the ``.data`` section is referenced by the symbol
14173``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
14174hardware shader stage the data is for. E.g.,
14175``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14176
14177.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14178
14179Spill Table
14180###########
14181
14182It is possible for a hardware shader to need access to more *user data
14183entries* than there are slots available in user data registers for one
14184or more hardware shader stages. In that case, the PAL runtime expects
14185the necessary *user data entries* to be spilled to GPU memory and use
14186one user data register to point to the spilled user data memory. The
14187value of the *user data entry* must then represent the location where
14188a shader expects to read the low 32-bits of the table's GPU virtual
14189address. The *spill table* itself represents a set of 32-bit values
14190managed by the PAL runtime in GPU-accessible memory that can be made
14191indirectly accessible to a hardware shader.
14192
14193Unspecified OS
14194--------------
14195
14196This section provides code conventions used when the target triple OS is
14197empty (see :ref:`amdgpu-target-triples`).
14198
14199Trap Handler ABI
14200~~~~~~~~~~~~~~~~
14201
14202For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14203not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14204instructions are handled as follows:
14205
14206  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14207     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14208
14209     =============== =============== ===========================================
14210     Usage           Code Sequence   Description
14211     =============== =============== ===========================================
14212     llvm.trap       s_endpgm        Causes wavefront to be terminated.
14213     llvm.debugtrap  *none*          Compiler warning given that there is no
14214                                     trap handler installed.
14215     =============== =============== ===========================================
14216
14217Source Languages
14218================
14219
14220.. _amdgpu-opencl:
14221
14222OpenCL
14223------
14224
14225When the language is OpenCL the following differences occur:
14226
142271. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
142282. The AMDGPU backend appends additional arguments to the kernel's explicit
14229   arguments for the AMDHSA OS (see
14230   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
142313. Additional metadata is generated
14232   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14233
14234  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14235     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14236
14237     ======== ==== ========= ===========================================
14238     Position Byte Byte      Description
14239              Size Alignment
14240     ======== ==== ========= ===========================================
14241     1        8    8         OpenCL Global Offset X
14242     2        8    8         OpenCL Global Offset Y
14243     3        8    8         OpenCL Global Offset Z
14244     4        8    8         OpenCL address of printf buffer
14245     5        8    8         OpenCL address of virtual queue used by
14246                             enqueue_kernel.
14247     6        8    8         OpenCL address of AqlWrap struct used by
14248                             enqueue_kernel.
14249     7        8    8         Pointer argument used for Multi-gird
14250                             synchronization.
14251     ======== ==== ========= ===========================================
14252
14253.. _amdgpu-hcc:
14254
14255HCC
14256---
14257
14258When the language is HCC the following differences occur:
14259
142601. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14261
14262.. _amdgpu-assembler:
14263
14264Assembler
14265---------
14266
14267AMDGPU backend has LLVM-MC based assembler which is currently in development.
14268It supports AMDGCN GFX6-GFX11.
14269
14270This section describes general syntax for instructions and operands.
14271
14272Instructions
14273~~~~~~~~~~~~
14274
14275An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14276
14277  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14278    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14279
14280:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14281:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14282
14283The order of operands and modifiers is fixed.
14284Most modifiers are optional and may be omitted.
14285
14286Links to detailed instruction syntax description may be found in the following
14287table. Note that features under development are not included
14288in this description.
14289
14290    ============= ============================================= =======================================
14291    Architecture  Core ISA                                      ISA Variants and Extensions
14292    ============= ============================================= =======================================
14293    GCN 2         :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`             \-
14294    GCN 3, GCN 4  :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`             \-
14295    GCN 5         :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14296
14297                                                                :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14298
14299                                                                :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14300
14301                                                                :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14302
14303                                                                :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14304
14305                                                                :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14306
14307    CDNA 1        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14308
14309    CDNA 2        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14310
14311    CDNA 3        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14312
14313    RDNA 1        :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>`     :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14314
14315                                                                :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14316
14317                                                                :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14318
14319                                                                :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14320
14321    RDNA 2        :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>`   :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14322
14323                                                                :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14324
14325                                                                :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14326
14327                                                                :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14328
14329                                                                :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14330
14331                                                                :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14332
14333                                                                :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14334    ============= ============================================= =======================================
14335
14336For more information about instructions, their semantics and supported
14337combinations of operands, refer to one of instruction set architecture manuals
14338[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14339[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14340[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_ and
14341[AMD-GCN-GFX10-RDNA2]_.
14342
14343Operands
14344~~~~~~~~
14345
14346Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14347
14348Modifiers
14349~~~~~~~~~
14350
14351Detailed description of modifiers may be found
14352:doc:`here<AMDGPUModifierSyntax>`.
14353
14354Instruction Examples
14355~~~~~~~~~~~~~~~~~~~~
14356
14357DS
14358++
14359
14360.. code-block:: nasm
14361
14362  ds_add_u32 v2, v4 offset:16
14363  ds_write_src2_b64 v2 offset0:4 offset1:8
14364  ds_cmpst_f32 v2, v4, v6
14365  ds_min_rtn_f64 v[8:9], v2, v[4:5]
14366
14367For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14368Manual.
14369
14370FLAT
14371++++
14372
14373.. code-block:: nasm
14374
14375  flat_load_dword v1, v[3:4]
14376  flat_store_dwordx3 v[3:4], v[5:7]
14377  flat_atomic_swap v1, v[3:4], v5 glc
14378  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14379  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14380
14381For full list of supported instructions, refer to "FLAT instructions" in ISA
14382Manual.
14383
14384MUBUF
14385+++++
14386
14387.. code-block:: nasm
14388
14389  buffer_load_dword v1, off, s[4:7], s1
14390  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14391  buffer_store_format_xy v[1:2], off, s[4:7], s1
14392  buffer_wbinvl1
14393  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14394
14395For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14396Manual.
14397
14398SMRD/SMEM
14399+++++++++
14400
14401.. code-block:: nasm
14402
14403  s_load_dword s1, s[2:3], 0xfc
14404  s_load_dwordx8 s[8:15], s[2:3], s4
14405  s_load_dwordx16 s[88:103], s[2:3], s4
14406  s_dcache_inv_vol
14407  s_memtime s[4:5]
14408
14409For full list of supported instructions, refer to "Scalar Memory Operations" in
14410ISA Manual.
14411
14412SOP1
14413++++
14414
14415.. code-block:: nasm
14416
14417  s_mov_b32 s1, s2
14418  s_mov_b64 s[0:1], 0x80000000
14419  s_cmov_b32 s1, 200
14420  s_wqm_b64 s[2:3], s[4:5]
14421  s_bcnt0_i32_b64 s1, s[2:3]
14422  s_swappc_b64 s[2:3], s[4:5]
14423  s_cbranch_join s[4:5]
14424
14425For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14426Manual.
14427
14428SOP2
14429++++
14430
14431.. code-block:: nasm
14432
14433  s_add_u32 s1, s2, s3
14434  s_and_b64 s[2:3], s[4:5], s[6:7]
14435  s_cselect_b32 s1, s2, s3
14436  s_andn2_b32 s2, s4, s6
14437  s_lshr_b64 s[2:3], s[4:5], s6
14438  s_ashr_i32 s2, s4, s6
14439  s_bfm_b64 s[2:3], s4, s6
14440  s_bfe_i64 s[2:3], s[4:5], s6
14441  s_cbranch_g_fork s[4:5], s[6:7]
14442
14443For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14444Manual.
14445
14446SOPC
14447++++
14448
14449.. code-block:: nasm
14450
14451  s_cmp_eq_i32 s1, s2
14452  s_bitcmp1_b32 s1, s2
14453  s_bitcmp0_b64 s[2:3], s4
14454  s_setvskip s3, s5
14455
14456For full list of supported instructions, refer to "SOPC Instructions" in ISA
14457Manual.
14458
14459SOPP
14460++++
14461
14462.. code-block:: nasm
14463
14464  s_barrier
14465  s_nop 2
14466  s_endpgm
14467  s_waitcnt 0 ; Wait for all counters to be 0
14468  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14469  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14470  s_sethalt 9
14471  s_sleep 10
14472  s_sendmsg 0x1
14473  s_sendmsg sendmsg(MSG_INTERRUPT)
14474  s_trap 1
14475
14476For full list of supported instructions, refer to "SOPP Instructions" in ISA
14477Manual.
14478
14479Unless otherwise mentioned, little verification is performed on the operands
14480of SOPP Instructions, so it is up to the programmer to be familiar with the
14481range or acceptable values.
14482
14483VALU
14484++++
14485
14486For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14487the assembler will automatically use optimal encoding based on its operands. To
14488force specific encoding, one can add a suffix to the opcode of the instruction:
14489
14490* _e32 for 32-bit VOP1/VOP2/VOPC
14491* _e64 for 64-bit VOP3
14492* _dpp for VOP_DPP
14493* _sdwa for VOP_SDWA
14494
14495VOP1/VOP2/VOP3/VOPC examples:
14496
14497.. code-block:: nasm
14498
14499  v_mov_b32 v1, v2
14500  v_mov_b32_e32 v1, v2
14501  v_nop
14502  v_cvt_f64_i32_e32 v[1:2], v2
14503  v_floor_f32_e32 v1, v2
14504  v_bfrev_b32_e32 v1, v2
14505  v_add_f32_e32 v1, v2, v3
14506  v_mul_i32_i24_e64 v1, v2, 3
14507  v_mul_i32_i24_e32 v1, -3, v3
14508  v_mul_i32_i24_e32 v1, -100, v3
14509  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14510  v_max_f16_e32 v1, v2, v3
14511
14512VOP_DPP examples:
14513
14514.. code-block:: nasm
14515
14516  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14517  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14518  v_mov_b32 v0, v0 wave_shl:1
14519  v_mov_b32 v0, v0 row_mirror
14520  v_mov_b32 v0, v0 row_bcast:31
14521  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14522  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14523  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14524
14525VOP_SDWA examples:
14526
14527.. code-block:: nasm
14528
14529  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14530  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14531  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14532  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14533  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14534
14535For full list of supported instructions, refer to "Vector ALU instructions".
14536
14537.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14538
14539Code Object V2 Predefined Symbols
14540~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14541
14542.. warning::
14543  Code object V2 is not the default code object version emitted by
14544  this version of LLVM.
14545
14546The AMDGPU assembler defines and updates some symbols automatically. These
14547symbols do not affect code generation.
14548
14549.option.machine_version_major
14550+++++++++++++++++++++++++++++
14551
14552Set to the GFX major generation number of the target being assembled for. For
14553example, when assembling for a "GFX9" target this will be set to the integer
14554value "9". The possible GFX major generation numbers are presented in
14555:ref:`amdgpu-processors`.
14556
14557.option.machine_version_minor
14558+++++++++++++++++++++++++++++
14559
14560Set to the GFX minor generation number of the target being assembled for. For
14561example, when assembling for a "GFX810" target this will be set to the integer
14562value "1". The possible GFX minor generation numbers are presented in
14563:ref:`amdgpu-processors`.
14564
14565.option.machine_version_stepping
14566++++++++++++++++++++++++++++++++
14567
14568Set to the GFX stepping generation number of the target being assembled for.
14569For example, when assembling for a "GFX704" target this will be set to the
14570integer value "4". The possible GFX stepping generation numbers are presented
14571in :ref:`amdgpu-processors`.
14572
14573.kernel.vgpr_count
14574++++++++++++++++++
14575
14576Set to zero each time a
14577:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14578encountered. At each instruction, if the current value of this symbol is less
14579than or equal to the maximum VGPR number explicitly referenced within that
14580instruction then the symbol value is updated to equal that VGPR number plus
14581one.
14582
14583.kernel.sgpr_count
14584++++++++++++++++++
14585
14586Set to zero each time a
14587:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14588encountered. At each instruction, if the current value of this symbol is less
14589than or equal to the maximum VGPR number explicitly referenced within that
14590instruction then the symbol value is updated to equal that SGPR number plus
14591one.
14592
14593.. _amdgpu-amdhsa-assembler-directives-v2:
14594
14595Code Object V2 Directives
14596~~~~~~~~~~~~~~~~~~~~~~~~~
14597
14598.. warning::
14599  Code object V2 is not the default code object version emitted by
14600  this version of LLVM.
14601
14602AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14603one can specify them with assembler directives.
14604
14605.hsa_code_object_version major, minor
14606+++++++++++++++++++++++++++++++++++++
14607
14608*major* and *minor* are integers that specify the version of the HSA code
14609object that will be generated by the assembler.
14610
14611.hsa_code_object_isa [major, minor, stepping, vendor, arch]
14612+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14613
14614
14615*major*, *minor*, and *stepping* are all integers that describe the instruction
14616set architecture (ISA) version of the assembly program.
14617
14618*vendor* and *arch* are quoted strings. *vendor* should always be equal to
14619"AMD" and *arch* should always be equal to "AMDGPU".
14620
14621By default, the assembler will derive the ISA version, *vendor*, and *arch*
14622from the value of the -mcpu option that is passed to the assembler.
14623
14624.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14625
14626.amdgpu_hsa_kernel (name)
14627+++++++++++++++++++++++++
14628
14629This directives specifies that the symbol with given name is a kernel entry
14630point (label) and the object should contain corresponding symbol of type
14631STT_AMDGPU_HSA_KERNEL.
14632
14633.amd_kernel_code_t
14634++++++++++++++++++
14635
14636This directive marks the beginning of a list of key / value pairs that are used
14637to specify the amd_kernel_code_t object that will be emitted by the assembler.
14638The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14639amd_kernel_code_t values that are unspecified a default value will be used. The
14640default value for all keys is 0, with the following exceptions:
14641
14642- *amd_code_version_major* defaults to 1.
14643- *amd_kernel_code_version_minor* defaults to 2.
14644- *amd_machine_kind* defaults to 1.
14645- *amd_machine_version_major*, *machine_version_minor*, and
14646  *amd_machine_version_stepping* are derived from the value of the -mcpu option
14647  that is passed to the assembler.
14648- *kernel_code_entry_byte_offset* defaults to 256.
14649- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14650  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14651  Note that wavefront size is specified as a power of two, so a value of **n**
14652  means a size of 2^ **n**.
14653- *call_convention* defaults to -1.
14654- *kernarg_segment_alignment*, *group_segment_alignment*, and
14655  *private_segment_alignment* default to 4. Note that alignments are specified
14656  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14657- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14658  GFX90A onwards.
14659- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14660  GFX10 onwards.
14661- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14662
14663The *.amd_kernel_code_t* directive must be placed immediately after the
14664function label and before any instructions.
14665
14666For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14667comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14668
14669.. _amdgpu-amdhsa-assembler-example-v2:
14670
14671Code Object V2 Example Source Code
14672~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14673
14674.. warning::
14675  Code Object V2 is not the default code object version emitted by
14676  this version of LLVM.
14677
14678Here is an example of a minimal assembly source file, defining one HSA kernel:
14679
14680.. code::
14681   :number-lines:
14682
14683   .hsa_code_object_version 1,0
14684   .hsa_code_object_isa
14685
14686   .hsatext
14687   .globl  hello_world
14688   .p2align 8
14689   .amdgpu_hsa_kernel hello_world
14690
14691   hello_world:
14692
14693      .amd_kernel_code_t
14694         enable_sgpr_kernarg_segment_ptr = 1
14695         is_ptr64 = 1
14696         compute_pgm_rsrc1_vgprs = 0
14697         compute_pgm_rsrc1_sgprs = 0
14698         compute_pgm_rsrc2_user_sgpr = 2
14699         compute_pgm_rsrc1_wgp_mode = 0
14700         compute_pgm_rsrc1_mem_ordered = 0
14701         compute_pgm_rsrc1_fwd_progress = 1
14702     .end_amd_kernel_code_t
14703
14704     s_load_dwordx2 s[0:1], s[0:1] 0x0
14705     v_mov_b32 v0, 3.14159
14706     s_waitcnt lgkmcnt(0)
14707     v_mov_b32 v1, s0
14708     v_mov_b32 v2, s1
14709     flat_store_dword v[1:2], v0
14710     s_endpgm
14711   .Lfunc_end0:
14712        .size   hello_world, .Lfunc_end0-hello_world
14713
14714.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14715
14716Code Object V3 and Above Predefined Symbols
14717~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14718
14719The AMDGPU assembler defines and updates some symbols automatically. These
14720symbols do not affect code generation.
14721
14722.amdgcn.gfx_generation_number
14723+++++++++++++++++++++++++++++
14724
14725Set to the GFX major generation number of the target being assembled for. For
14726example, when assembling for a "GFX9" target this will be set to the integer
14727value "9". The possible GFX major generation numbers are presented in
14728:ref:`amdgpu-processors`.
14729
14730.amdgcn.gfx_generation_minor
14731++++++++++++++++++++++++++++
14732
14733Set to the GFX minor generation number of the target being assembled for. For
14734example, when assembling for a "GFX810" target this will be set to the integer
14735value "1". The possible GFX minor generation numbers are presented in
14736:ref:`amdgpu-processors`.
14737
14738.amdgcn.gfx_generation_stepping
14739+++++++++++++++++++++++++++++++
14740
14741Set to the GFX stepping generation number of the target being assembled for.
14742For example, when assembling for a "GFX704" target this will be set to the
14743integer value "4". The possible GFX stepping generation numbers are presented
14744in :ref:`amdgpu-processors`.
14745
14746.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14747
14748.amdgcn.next_free_vgpr
14749++++++++++++++++++++++
14750
14751Set to zero before assembly begins. At each instruction, if the current value
14752of this symbol is less than or equal to the maximum VGPR number explicitly
14753referenced within that instruction then the symbol value is updated to equal
14754that VGPR number plus one.
14755
14756May be used to set the `.amdhsa_next_free_vgpr` directive in
14757:ref:`amdhsa-kernel-directives-table`.
14758
14759May be set at any time, e.g. manually set to zero at the start of each kernel.
14760
14761.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14762
14763.amdgcn.next_free_sgpr
14764++++++++++++++++++++++
14765
14766Set to zero before assembly begins. At each instruction, if the current value
14767of this symbol is less than or equal the maximum SGPR number explicitly
14768referenced within that instruction then the symbol value is updated to equal
14769that SGPR number plus one.
14770
14771May be used to set the `.amdhsa_next_free_spgr` directive in
14772:ref:`amdhsa-kernel-directives-table`.
14773
14774May be set at any time, e.g. manually set to zero at the start of each kernel.
14775
14776.. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14777
14778Code Object V3 and Above Directives
14779~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14780
14781Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14782architecture processors, and are not OS-specific. Directives which begin with
14783``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14784``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14785:ref:`amdgpu-processors`.
14786
14787.. _amdgpu-assembler-directive-amdgcn-target:
14788
14789.amdgcn_target <target-triple> "-" <target-id>
14790++++++++++++++++++++++++++++++++++++++++++++++
14791
14792Optional directive which declares the ``<target-triple>-<target-id>`` supported
14793by the containing assembler source file. Used by the assembler to validate
14794command-line options such as ``-triple``, ``-mcpu``, and
14795``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14796:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14797
14798.. note::
14799
14800  The target ID syntax used for code object V2 to V3 for this directive differs
14801  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14802
14803.amdhsa_kernel <name>
14804+++++++++++++++++++++
14805
14806Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14807``<name>.kd``, in the current location of the current section. Only valid when
14808the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14809instruction to execute, and does not need to be previously defined.
14810
14811Marks the beginning of a list of directives used to generate the bytes of a
14812kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14813Directives which may appear in this list are described in
14814:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14815be valid for the target being assembled for, and cannot be repeated. Directives
14816support the range of values specified by the field they reference in
14817:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14818assumed to have its default value, unless it is marked as "Required", in which
14819case it is an error to omit the directive. This list of directives is
14820terminated by an ``.end_amdhsa_kernel`` directive.
14821
14822  .. table:: AMDHSA Kernel Assembler Directives
14823     :name: amdhsa-kernel-directives-table
14824
14825     ======================================================== =================== ============ ===================
14826     Directive                                                Default             Supported On Description
14827     ======================================================== =================== ============ ===================
14828     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX11   Controls GROUP_SEGMENT_FIXED_SIZE in
14829                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14830     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX11   Controls PRIVATE_SEGMENT_FIXED_SIZE in
14831                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14832     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX11   Controls KERNARG_SIZE in
14833                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14834     ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX11   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14835                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14836     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14837                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14838                                                                                  GFX940)
14839     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_PTR in
14840                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14841     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX11   Controls ENABLE_SGPR_QUEUE_PTR in
14842                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14843     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX11   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14844                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14845     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_ID in
14846                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14847     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14848                                                                                  (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14849                                                                                  GFX940)
14850     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX11   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14851                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14852     ``.amdhsa_wavefront_size32``                             Target              GFX10-GFX11  Controls ENABLE_WAVEFRONT_SIZE32 in
14853                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14854                                                              Specific
14855                                                              (wavefrontsize64)
14856     ``.amdhsa_uses_dynamic_stack``                           0                   GFX6-GFX11   Controls USES_DYNAMIC_STACK in
14857                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14858     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
14859                                                                                  (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14860                                                                                  GFX940)
14861     ``.amdhsa_enable_private_segment``                       0                   GFX940,      Controls ENABLE_PRIVATE_SEGMENT in
14862                                                                                  GFX11        :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14863     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_X in
14864                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14865     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14866                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14867     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14868                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14869     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_INFO in
14870                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14871     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX11   Controls ENABLE_VGPR_WORKITEM_ID in
14872                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14873                                                                                               Possible values are defined in
14874                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14875     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX11   Maximum VGPR number explicitly referenced, plus one.
14876                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14877                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14878     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX11   Maximum SGPR number explicitly referenced, plus one.
14879                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14880                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14881     ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
14882                                                                                  GFX940       Used to calculate ACCUM_OFFSET in
14883                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14884     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX11   Whether the kernel may use the special VCC SGPR.
14885                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14886                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14887     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
14888                                                                                  (except      scratch memory. Used to calculate
14889                                                                                  GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
14890                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14891     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
14892                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14893                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14894                                                              (xnack)
14895     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_32 in
14896                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14897                                                                                               Possible values are defined in
14898                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14899     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_16_64 in
14900                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14901                                                                                               Possible values are defined in
14902                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14903     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_32 in
14904                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14905                                                                                               Possible values are defined in
14906                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14907     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_16_64 in
14908                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14909                                                                                               Possible values are defined in
14910                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14911     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX11   Controls ENABLE_DX10_CLAMP in
14912                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14913     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX11   Controls ENABLE_IEEE_MODE in
14914                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14915     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX11   Controls FP16_OVFL in
14916                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14917     ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
14918                                                              Feature             GFX940,      :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14919                                                              Specific            GFX11
14920                                                              (tgsplit)
14921     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10-GFX11  Controls ENABLE_WGP_MODE in
14922                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14923                                                              Specific
14924                                                              (cumode)
14925     ``.amdhsa_memory_ordered``                               1                   GFX10-GFX11  Controls MEM_ORDERED in
14926                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14927     ``.amdhsa_forward_progress``                             0                   GFX10-GFX11  Controls FWD_PROGRESS in
14928                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14929     ``.amdhsa_shared_vgpr_count``                            0                   GFX10-GFX11  Controls SHARED_VGPR_COUNT in
14930                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
14931     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
14932                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14933     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
14934                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14935     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
14936                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14937     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
14938                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14939     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
14940                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14941     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
14942                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14943     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
14944                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14945     ======================================================== =================== ============ ===================
14946
14947.amdgpu_metadata
14948++++++++++++++++
14949
14950Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
14951note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
14952
14953The contents must be in the [YAML]_ markup format, with the same structure and
14954semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
14955:ref:`amdgpu-amdhsa-code-object-metadata-v4` or
14956:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
14957
14958This directive is terminated by an ``.end_amdgpu_metadata`` directive.
14959
14960.. _amdgpu-amdhsa-assembler-example-v3-onwards:
14961
14962Code Object V3 and Above Example Source Code
14963~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14964
14965Here is an example of a minimal assembly source file, defining one HSA kernel:
14966
14967.. code::
14968   :number-lines:
14969
14970   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
14971
14972   .text
14973   .globl hello_world
14974   .p2align 8
14975   .type hello_world,@function
14976   hello_world:
14977     s_load_dwordx2 s[0:1], s[0:1] 0x0
14978     v_mov_b32 v0, 3.14159
14979     s_waitcnt lgkmcnt(0)
14980     v_mov_b32 v1, s0
14981     v_mov_b32 v2, s1
14982     flat_store_dword v[1:2], v0
14983     s_endpgm
14984   .Lfunc_end0:
14985     .size   hello_world, .Lfunc_end0-hello_world
14986
14987   .rodata
14988   .p2align 6
14989   .amdhsa_kernel hello_world
14990     .amdhsa_user_sgpr_kernarg_segment_ptr 1
14991     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
14992     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
14993   .end_amdhsa_kernel
14994
14995   .amdgpu_metadata
14996   ---
14997   amdhsa.version:
14998     - 1
14999     - 0
15000   amdhsa.kernels:
15001     - .name: hello_world
15002       .symbol: hello_world.kd
15003       .kernarg_segment_size: 48
15004       .group_segment_fixed_size: 0
15005       .private_segment_fixed_size: 0
15006       .kernarg_segment_align: 4
15007       .wavefront_size: 64
15008       .sgpr_count: 2
15009       .vgpr_count: 3
15010       .max_flat_workgroup_size: 256
15011       .args:
15012         - .size: 8
15013           .offset: 0
15014           .value_kind: global_buffer
15015           .address_space: global
15016           .actual_access: write_only
15017   //...
15018   .end_amdgpu_metadata
15019
15020This kernel is equivalent to the following HIP program:
15021
15022.. code::
15023   :number-lines:
15024
15025   __global__ void hello_world(float *p) {
15026       *p = 3.14159f;
15027   }
15028
15029If an assembly source file contains multiple kernels and/or functions, the
15030:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15031:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15032the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15033kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15034to group the function with the kernel that calls it and reset the symbols
15035between the two connected components:
15036
15037.. code::
15038   :number-lines:
15039
15040   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15041
15042   // gpr tracking symbols are implicitly set to zero
15043
15044   .text
15045   .globl kern0
15046   .p2align 8
15047   .type kern0,@function
15048   kern0:
15049     // ...
15050     s_endpgm
15051   .Lkern0_end:
15052     .size   kern0, .Lkern0_end-kern0
15053
15054   .rodata
15055   .p2align 6
15056   .amdhsa_kernel kern0
15057     // ...
15058     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15059     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15060   .end_amdhsa_kernel
15061
15062   // reset symbols to begin tracking usage in func1 and kern1
15063   .set .amdgcn.next_free_vgpr, 0
15064   .set .amdgcn.next_free_sgpr, 0
15065
15066   .text
15067   .hidden func1
15068   .global func1
15069   .p2align 2
15070   .type func1,@function
15071   func1:
15072     // ...
15073     s_setpc_b64 s[30:31]
15074   .Lfunc1_end:
15075   .size func1, .Lfunc1_end-func1
15076
15077   .globl kern1
15078   .p2align 8
15079   .type kern1,@function
15080   kern1:
15081     // ...
15082     s_getpc_b64 s[4:5]
15083     s_add_u32 s4, s4, func1@rel32@lo+4
15084     s_addc_u32 s5, s5, func1@rel32@lo+4
15085     s_swappc_b64 s[30:31], s[4:5]
15086     // ...
15087     s_endpgm
15088   .Lkern1_end:
15089     .size   kern1, .Lkern1_end-kern1
15090
15091   .rodata
15092   .p2align 6
15093   .amdhsa_kernel kern1
15094     // ...
15095     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15096     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15097   .end_amdhsa_kernel
15098
15099These symbols cannot identify connected components in order to automatically
15100track the usage for each kernel. However, in some cases careful organization of
15101the kernels and functions in the source file means there is minimal additional
15102effort required to accurately calculate GPR usage.
15103
15104Additional Documentation
15105========================
15106
15107.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15108.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15109.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15110.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15111.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15112.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15113.. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15114.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15115.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15116.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15117.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15118.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15119.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15120.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15121.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15122.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15123.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15124.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15125.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15126.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15127.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15128.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15129.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15130.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15131.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
15132