1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX10 19 AMDGPU/AMDGPUAsmGFX1011 20 AMDGPUModifierSyntax 21 AMDGPUOperandSyntax 22 AMDGPUInstructionSyntax 23 AMDGPUInstructionNotation 24 AMDGPUDwarfExtensionsForHeterogeneousDebugging 25 26Introduction 27============ 28 29The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 30R600 family up until the current GCN families. It lives in the 31``llvm/lib/Target/AMDGPU`` directory. 32 33LLVM 34==== 35 36.. _amdgpu-target-triples: 37 38Target Triples 39-------------- 40 41Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 42to specify the target triple: 43 44 .. table:: AMDGPU Architectures 45 :name: amdgpu-architecture-table 46 47 ============ ============================================================== 48 Architecture Description 49 ============ ============================================================== 50 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 51 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 52 ============ ============================================================== 53 54 .. table:: AMDGPU Vendors 55 :name: amdgpu-vendor-table 56 57 ============ ============================================================== 58 Vendor Description 59 ============ ============================================================== 60 ``amd`` Can be used for all AMD GPU usage. 61 ``mesa3d`` Can be used if the OS is ``mesa3d``. 62 ============ ============================================================== 63 64 .. table:: AMDGPU Operating Systems 65 :name: amdgpu-os 66 67 ============== ============================================================ 68 OS Description 69 ============== ============================================================ 70 *<empty>* Defaults to the *unknown* OS. 71 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 72 such as: 73 74 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 75 loader on Linux. See *AMD ROCm Platform Release Notes* 76 [AMD-ROCm-Release-Notes]_ for supported hardware and 77 software. 78 - AMD's PAL runtime using the *pal-amdhsa* loader on 79 Windows. 80 81 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 82 runtime using the *pal-amdpal* loader on Windows and Linux 83 Pro. 84 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 85 3D runtime using the *mesa-mesa3d* loader on Linux. 86 ============== ============================================================ 87 88 .. table:: AMDGPU Environments 89 :name: amdgpu-environment-table 90 91 ============ ============================================================== 92 Environment Description 93 ============ ============================================================== 94 *<empty>* Default. 95 ============ ============================================================== 96 97.. _amdgpu-processors: 98 99Processors 100---------- 101 102Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 103specify the AMDGPU processor together with optional target features. See 104:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 105specific information. 106 107Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 108 109* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 110 111 112 .. table:: AMDGPU Processors 113 :name: amdgpu-processor-table 114 115 =========== =============== ============ ===== ================= =============== =============== ====================== 116 Processor Alternative Target dGPU/ Target Target OS Support Example 117 Processor Triple APU Features Properties *(see* Products 118 Architecture Supported `amdgpu-os`_ 119 *and 120 corresponding 121 runtime release 122 notes for 123 current 124 information and 125 level of 126 support)* 127 =========== =============== ============ ===== ================= =============== =============== ====================== 128 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 129 ----------------------------------------------------------------------------------------------------------------------- 130 ``r600`` ``r600`` dGPU - Does not 131 support 132 generic 133 address 134 space 135 ``r630`` ``r600`` dGPU - Does not 136 support 137 generic 138 address 139 space 140 ``rs880`` ``r600`` dGPU - Does not 141 support 142 generic 143 address 144 space 145 ``rv670`` ``r600`` dGPU - Does not 146 support 147 generic 148 address 149 space 150 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 151 ----------------------------------------------------------------------------------------------------------------------- 152 ``rv710`` ``r600`` dGPU - Does not 153 support 154 generic 155 address 156 space 157 ``rv730`` ``r600`` dGPU - Does not 158 support 159 generic 160 address 161 space 162 ``rv770`` ``r600`` dGPU - Does not 163 support 164 generic 165 address 166 space 167 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 168 ----------------------------------------------------------------------------------------------------------------------- 169 ``cedar`` ``r600`` dGPU - Does not 170 support 171 generic 172 address 173 space 174 ``cypress`` ``r600`` dGPU - Does not 175 support 176 generic 177 address 178 space 179 ``juniper`` ``r600`` dGPU - Does not 180 support 181 generic 182 address 183 space 184 ``redwood`` ``r600`` dGPU - Does not 185 support 186 generic 187 address 188 space 189 ``sumo`` ``r600`` dGPU - Does not 190 support 191 generic 192 address 193 space 194 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 195 ----------------------------------------------------------------------------------------------------------------------- 196 ``barts`` ``r600`` dGPU - Does not 197 support 198 generic 199 address 200 space 201 ``caicos`` ``r600`` dGPU - Does not 202 support 203 generic 204 address 205 space 206 ``cayman`` ``r600`` dGPU - Does not 207 support 208 generic 209 address 210 space 211 ``turks`` ``r600`` dGPU - Does not 212 support 213 generic 214 address 215 space 216 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 217 ----------------------------------------------------------------------------------------------------------------------- 218 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 219 support 220 generic 221 address 222 space 223 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 224 - ``verde`` support 225 generic 226 address 227 space 228 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 229 - ``oland`` support 230 generic 231 address 232 space 233 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 234 ----------------------------------------------------------------------------------------------------------------------- 235 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 236 flat - *pal-amdhsa* - A6 Pro-7050B 237 scratch - *pal-amdpal* - A8-7100 238 - A8 Pro-7150B 239 - A10-7300 240 - A10 Pro-7350B 241 - FX-7500 242 - A8-7200P 243 - A10-7400P 244 - FX-7600P 245 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 246 flat - *pal-amdhsa* - FirePro W9100 247 scratch - *pal-amdpal* - FirePro S9150 248 - FirePro S9170 249 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 250 flat - *pal-amdhsa* - Radeon R9 290x 251 scratch - *pal-amdpal* - Radeon R390 252 - Radeon R390x 253 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 254 - ``mullins`` flat - *pal-amdpal* - E1-2200 255 scratch - E1-2500 256 - E2-3000 257 - E2-3800 258 - A4-5000 259 - A4-5100 260 - A6-5200 261 - A4 Pro-3340B 262 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 263 flat - *pal-amdpal* - Radeon HD 8770 264 scratch - R7 260 265 - R7 260X 266 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 267 flat - *pal-amdpal* 268 scratch .. TODO:: 269 270 Add product 271 names. 272 273 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 274 ----------------------------------------------------------------------------------------------------------------------- 275 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 276 flat - *pal-amdhsa* - Pro A6-8500B 277 scratch - *pal-amdpal* - A8-8600P 278 - Pro A8-8600B 279 - FX-8800P 280 - Pro A12-8800B 281 - A10-8700P 282 - Pro A10-8700B 283 - A10-8780P 284 - A10-9600P 285 - A10-9630P 286 - A12-9700P 287 - A12-9730P 288 - FX-9800P 289 - FX-9830P 290 - E2-9010 291 - A6-9210 292 - A9-9410 293 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 294 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 295 scratch - *pal-amdpal* - Radeon R9 385 296 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 297 - *pal-amdhsa* - Radeon R9 Fury 298 - *pal-amdpal* - Radeon R9 FuryX 299 - Radeon Pro Duo 300 - FirePro S9300x2 301 - Radeon Instinct MI8 302 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 303 flat - *pal-amdhsa* - Radeon RX 480 304 scratch - *pal-amdpal* - Radeon Instinct MI6 305 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 306 flat - *pal-amdhsa* 307 scratch - *pal-amdpal* 308 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 309 flat - *pal-amdhsa* - FirePro S7100 310 scratch - *pal-amdpal* - FirePro W7100 311 - Mobile FirePro 312 M7170 313 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 314 flat - *pal-amdhsa* 315 scratch - *pal-amdpal* .. TODO:: 316 317 Add product 318 names. 319 320 **GCN GFX9 (Vega)** [AMD-GCN-GFX9]_ 321 ----------------------------------------------------------------------------------------------------------------------- 322 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 323 flat - *pal-amdhsa* Frontier Edition 324 scratch - *pal-amdpal* - Radeon RX Vega 56 325 - Radeon RX Vega 64 326 - Radeon RX Vega 64 327 Liquid 328 - Radeon Instinct MI25 329 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 330 flat - *pal-amdhsa* - Ryzen 5 2400G 331 scratch - *pal-amdpal* 332 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 333 - *pal-amdhsa* 334 - *pal-amdpal* .. TODO:: 335 336 Add product 337 names. 338 339 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 340 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 341 scratch - *pal-amdpal* - Radeon VII 342 - Radeon Pro VII 343 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* *TBA* 344 - xnack - Absolute 345 flat .. TODO:: 346 scratch 347 Add product 348 names. 349 350 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 351 flat 352 scratch .. TODO:: 353 354 Add product 355 names. 356 357 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 358 - tgsplit flat 359 - xnack scratch .. TODO:: 360 - Packed 361 work-item Add product 362 IDs names. 363 364 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 365 flat - Ryzen 7 4700GE 366 scratch - Ryzen 5 4600G 367 - Ryzen 5 4600GE 368 - Ryzen 3 4300G 369 - Ryzen 3 4300GE 370 - Ryzen Pro 4000G 371 - Ryzen 7 Pro 4700G 372 - Ryzen 7 Pro 4750GE 373 - Ryzen 5 Pro 4650G 374 - Ryzen 5 Pro 4650GE 375 - Ryzen 3 Pro 4350G 376 - Ryzen 3 Pro 4350GE 377 378 **GCN GFX10 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 379 ----------------------------------------------------------------------------------------------------------------------- 380 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 381 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 382 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 383 - Radeon Pro 5600M 384 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* *TBA* 385 - wavefrontsize64 - Absolute - *pal-amdhsa* 386 - xnack flat - *pal-amdpal* 387 scratch .. TODO:: 388 389 Add product 390 names. 391 392 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 393 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 394 - xnack scratch - *pal-amdpal* 395 **GCN GFX10 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 396 ----------------------------------------------------------------------------------------------------------------------- 397 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 398 - wavefrontsize64 flat - *pal-amdhsa* 399 scratch - *pal-amdpal* .. TODO:: 400 401 Add product 402 names. 403 404 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 405 - wavefrontsize64 flat - *pal-amdhsa* 406 scratch - *pal-amdpal* .. TODO:: 407 408 Add product 409 names. 410 411 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 412 - wavefrontsize64 flat - *pal-amdhsa* 413 scratch - *pal-amdpal* .. TODO:: 414 415 Add product 416 names. 417 418 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 419 - wavefrontsize64 flat 420 scratch .. TODO:: 421 422 Add product 423 names. 424 425 =========== =============== ============ ===== ================= =============== =============== ====================== 426 427.. _amdgpu-target-features: 428 429Target Features 430--------------- 431 432Target features control how code is generated to support certain 433processor specific features. Not all target features are supported by 434all processors. The runtime must ensure that the features supported by 435the device used to execute the code match the features enabled when 436generating the code. A mismatch of features may result in incorrect 437execution, or a reduction in performance. 438 439The target features supported by each processor is listed in 440:ref:`amdgpu-processor-table`. 441 442Target features are controlled by exactly one of the following Clang 443options: 444 445``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 446 447 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 448 optional components of the target ID. If omitted, the target feature has the 449 ``any`` value. See :ref:`amdgpu-target-id`. 450 451``-m[no-]<target-feature>`` 452 453 Target features not specified by the target ID are specified using a 454 separate option. These target features can have an ``on`` or ``off`` 455 value. ``on`` is specified by omitting the ``no-`` prefix, and 456 ``off`` is specified by including the ``no-`` prefix. The default 457 if not specified is ``off``. 458 459For example: 460 461``-mcpu=gfx908:xnack+`` 462 Enable the ``xnack`` feature. 463``-mcpu=gfx908:xnack-`` 464 Disable the ``xnack`` feature. 465``-mcumode`` 466 Enable the ``cumode`` feature. 467``-mno-cumode`` 468 Disable the ``cumode`` feature. 469 470 .. table:: AMDGPU Target Features 471 :name: amdgpu-target-features-table 472 473 =============== ============================ ================================================== 474 Target Feature Clang Option to Control Description 475 Name 476 =============== ============================ ================================================== 477 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 478 when generating code for kernels. When disabled 479 native WGP wavefront execution mode is used, 480 when enabled CU wavefront execution mode is used 481 (see :ref:`amdgpu-amdhsa-memory-model`). 482 483 sramecc - ``-mcpu`` If specified, generate code that can only be 484 - ``--offload-arch`` loaded and executed in a process that has a 485 matching setting for SRAMECC. 486 487 If not specified for code object V2 to V3, generate 488 code that can be loaded and executed in a process 489 with SRAMECC enabled. 490 491 If not specified for code object V4, generate 492 code that can be loaded and executed in a process 493 with either setting of SRAMECC. 494 495 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 496 work-groups are launched in threadgroup split mode. 497 When enabled the waves of a work-group may be 498 launched in different CUs. 499 500 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 501 generating code for kernels. When disabled 502 native wavefront size 32 is used, when enabled 503 wavefront size 64 is used. 504 505 xnack - ``-mcpu`` If specified, generate code that can only be 506 - ``--offload-arch`` loaded and executed in a process that has a 507 matching setting for XNACK replay. 508 509 If not specified for code object V2 to V3, generate 510 code that can be loaded and executed in a process 511 with XNACK replay enabled. 512 513 If not specified for code object V4, generate 514 code that can be loaded and executed in a process 515 with either setting of XNACK replay. 516 517 XNACK replay can be used for demand paging and 518 page migration. If enabled in the device, then if 519 a page fault occurs the code may execute 520 incorrectly unless generated with XNACK replay 521 enabled, or generated for code object V4 without 522 specifying XNACK replay. Executing code that was 523 generated with XNACK replay enabled, or generated 524 for code object V4 without specifying XNACK replay, 525 on a device that does not have XNACK replay 526 enabled will execute correctly but may be less 527 performant than code generated for XNACK replay 528 disabled. 529 =============== ============================ ================================================== 530 531.. _amdgpu-target-id: 532 533Target ID 534--------- 535 536AMDGPU supports target IDs. See `Clang Offload Bundler 537<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 538description. The AMDGPU target specific information is: 539 540**processor** 541 Is an AMDGPU processor or alternative processor name specified in 542 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 543 the primary processor and alternative processor names. The canonical form 544 target ID only allow the primary processor name. 545 546**target-feature** 547 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 548 is supported by the processor. The target features supported by each processor 549 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 550 a target ID are marked as being controlled by ``-mcpu`` and 551 ``--offload-arch``. Each target feature must appear at most once in a target 552 ID. The non-canonical form target ID allows the target features to be 553 specified in any order. The canonical form target ID requires the target 554 features to be specified in alphabetic order. 555 556.. _amdgpu-target-id-v2-v3: 557 558Code Object V2 to V3 Target ID 559~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 560 561The target ID syntax for code object V2 to V3 is the same as defined in `Clang 562Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 563when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 564directive and the bundle entry ID. In those cases it has the following BNF 565syntax: 566 567.. code:: 568 569 <target-id> ::== <processor> ( "+" <target-feature> )* 570 571Where a target feature is omitted if *Off* and present if *On* or *Any*. 572 573.. note:: 574 575 The code object V2 to V3 cannot represent *Any* and treats it the same as 576 *On*. 577 578.. _amdgpu-embedding-bundled-objects: 579 580Embedding Bundled Code Objects 581------------------------------ 582 583AMDGPU supports the HIP and OpenMP languages that perform code object embedding 584as described in `Clang Offload Bundler 585<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 586 587.. note:: 588 589 The target ID syntax used for code object V2 to V3 for a bundle entry ID 590 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 591 592.. _amdgpu-address-spaces: 593 594Address Spaces 595-------------- 596 597The AMDGPU architecture supports a number of memory address spaces. The address 598space names use the OpenCL standard names, with some additions. 599 600The AMDGPU address spaces correspond to target architecture specific LLVM 601address space numbers used in LLVM IR. 602 603The AMDGPU address spaces are described in 604:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 605supported for the ``amdgcn`` target. 606 607 .. table:: AMDGPU Address Spaces 608 :name: amdgpu-address-spaces-table 609 610 ================================= =============== =========== ================ ======= ============================ 611 .. 64-Bit Process Address Space 612 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 613 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 614 Space Number Name Name Size 615 ================================= =============== =========== ================ ======= ============================ 616 Generic 0 flat flat 64 0x0000000000000000 617 Global 1 global global 64 0x0000000000000000 618 Region 2 N/A GDS 32 *not implemented for AMDHSA* 619 Local 3 group LDS 32 0xFFFFFFFF 620 Constant 4 constant *same as global* 64 0x0000000000000000 621 Private 5 private scratch 32 0xFFFFFFFF 622 Constant 32-bit 6 *TODO* 0x00000000 623 Buffer Fat Pointer (experimental) 7 *TODO* 624 ================================= =============== =========== ================ ======= ============================ 625 626**Generic** 627 The generic address space is supported unless the *Target Properties* column 628 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 629 space*. 630 631 The generic address space uses the hardware flat address support for two fixed 632 ranges of virtual addresses (the private and local apertures), that are 633 outside the range of addressable global memory, to map from a flat address to 634 a private or local address. This uses FLAT instructions that can take a flat 635 address and access global, private (scratch), and group (LDS) memory depending 636 on if the address is within one of the aperture ranges. 637 638 Flat access to scratch requires hardware aperture setup and setup in the 639 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 640 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 641 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 642 643 To convert between a private or group address space address (termed a segment 644 address) and a flat address the base address of the corresponding aperture 645 can be used. For GFX7-GFX8 these are available in the 646 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 647 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 648 GFX9-GFX10 the aperture base addresses are directly available as inline 649 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 650 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 651 aligned to 2^32 which makes it easier to convert from flat to segment or 652 segment to flat. 653 654 A global address space address has the same value when used as a flat address 655 so no conversion is needed. 656 657**Global and Constant** 658 The global and constant address spaces both use global virtual addresses, 659 which are the same virtual address space used by the CPU. However, some 660 virtual addresses may only be accessible to the CPU, some only accessible 661 by the GPU, and some by both. 662 663 Using the constant address space indicates that the data will not change 664 during the execution of the kernel. This allows scalar read instructions to 665 be used. As the constant address space could only be modified on the host 666 side, a generic pointer loaded from the constant address space is safe to be 667 assumed as a global pointer since only the device global memory is visible 668 and managed on the host side. The vector and scalar L1 caches are invalidated 669 of volatile data before each kernel dispatch execution to allow constant 670 memory to change values between kernel dispatches. 671 672**Region** 673 The region address space uses the hardware Global Data Store (GDS). All 674 wavefronts executing on the same device will access the same memory for any 675 given region address. However, the same region address accessed by wavefronts 676 executing on different devices will access different memory. It is higher 677 performance than global memory. It is allocated by the runtime. The data 678 store (DS) instructions can be used to access it. 679 680**Local** 681 The local address space uses the hardware Local Data Store (LDS) which is 682 automatically allocated when the hardware creates the wavefronts of a 683 work-group, and freed when all the wavefronts of a work-group have 684 terminated. All wavefronts belonging to the same work-group will access the 685 same memory for any given local address. However, the same local address 686 accessed by wavefronts belonging to different work-groups will access 687 different memory. It is higher performance than global memory. The data store 688 (DS) instructions can be used to access it. 689 690**Private** 691 The private address space uses the hardware scratch memory support which 692 automatically allocates memory when it creates a wavefront and frees it when 693 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 694 given private address will be different to the memory accessed by another lane 695 of the same or different wavefront for the same private address. 696 697 If a kernel dispatch uses scratch, then the hardware allocates memory from a 698 pool of backing memory allocated by the runtime for each wavefront. The lanes 699 of the wavefront access this using dword (4 byte) interleaving. The mapping 700 used from private address to backing memory address is: 701 702 ``wavefront-scratch-base + 703 ((private-address / 4) * wavefront-size * 4) + 704 (wavefront-lane-id * 4) + (private-address % 4)`` 705 706 If each lane of a wavefront accesses the same private address, the 707 interleaving results in adjacent dwords being accessed and hence requires 708 fewer cache lines to be fetched. 709 710 There are different ways that the wavefront scratch base address is 711 determined by a wavefront (see 712 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 713 714 Scratch memory can be accessed in an interleaved manner using buffer 715 instructions with the scratch buffer descriptor and per wavefront scratch 716 offset, by the scratch instructions, or by flat instructions. Multi-dword 717 access is not supported except by flat and scratch instructions in 718 GFX9-GFX10. 719 720**Constant 32-bit** 721 *TODO* 722 723**Buffer Fat Pointer** 724 The buffer fat pointer is an experimental address space that is currently 725 unsupported in the backend. It exposes a non-integral pointer that is in 726 the future intended to support the modelling of 128-bit buffer descriptors 727 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 728 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 729 model the buffer descriptors used heavily in graphics workloads targeting 730 the backend. 731 732.. _amdgpu-memory-scopes: 733 734Memory Scopes 735------------- 736 737This section provides LLVM memory synchronization scopes supported by the AMDGPU 738backend memory model when the target triple OS is ``amdhsa`` (see 739:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 740 741The memory model supported is based on the HSA memory model [HSA]_ which is 742based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 743relation is transitive over the synchronizes-with relation independent of scope 744and synchronizes-with allows the memory scope instances to be inclusive (see 745table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 746 747This is different to the OpenCL [OpenCL]_ memory model which does not have scope 748inclusion and requires the memory scopes to exactly match. However, this 749is conservatively correct for OpenCL. 750 751 .. table:: AMDHSA LLVM Sync Scopes 752 :name: amdgpu-amdhsa-llvm-sync-scopes-table 753 754 ======================= =================================================== 755 LLVM Sync Scope Description 756 ======================= =================================================== 757 *none* The default: ``system``. 758 759 Synchronizes with, and participates in modification 760 and seq_cst total orderings with, other operations 761 (except image operations) for all address spaces 762 (except private, or generic that accesses private) 763 provided the other operation's sync scope is: 764 765 - ``system``. 766 - ``agent`` and executed by a thread on the same 767 agent. 768 - ``workgroup`` and executed by a thread in the 769 same work-group. 770 - ``wavefront`` and executed by a thread in the 771 same wavefront. 772 773 ``agent`` Synchronizes with, and participates in modification 774 and seq_cst total orderings with, other operations 775 (except image operations) for all address spaces 776 (except private, or generic that accesses private) 777 provided the other operation's sync scope is: 778 779 - ``system`` or ``agent`` and executed by a thread 780 on the same agent. 781 - ``workgroup`` and executed by a thread in the 782 same work-group. 783 - ``wavefront`` and executed by a thread in the 784 same wavefront. 785 786 ``workgroup`` Synchronizes with, and participates in modification 787 and seq_cst total orderings with, other operations 788 (except image operations) for all address spaces 789 (except private, or generic that accesses private) 790 provided the other operation's sync scope is: 791 792 - ``system``, ``agent`` or ``workgroup`` and 793 executed by a thread in the same work-group. 794 - ``wavefront`` and executed by a thread in the 795 same wavefront. 796 797 ``wavefront`` Synchronizes with, and participates in modification 798 and seq_cst total orderings with, other operations 799 (except image operations) for all address spaces 800 (except private, or generic that accesses private) 801 provided the other operation's sync scope is: 802 803 - ``system``, ``agent``, ``workgroup`` or 804 ``wavefront`` and executed by a thread in the 805 same wavefront. 806 807 ``singlethread`` Only synchronizes with and participates in 808 modification and seq_cst total orderings with, 809 other operations (except image operations) running 810 in the same thread for all address spaces (for 811 example, in signal handlers). 812 813 ``one-as`` Same as ``system`` but only synchronizes with other 814 operations within the same address space. 815 816 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 817 operations within the same address space. 818 819 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 820 other operations within the same address space. 821 822 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 823 other operations within the same address space. 824 825 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 826 other operations within the same address space. 827 ======================= =================================================== 828 829LLVM IR Intrinsics 830------------------ 831 832The AMDGPU backend implements the following LLVM IR intrinsics. 833 834*This section is WIP.* 835 836.. TODO:: 837 838 List AMDGPU intrinsics. 839 840LLVM IR Attributes 841------------------ 842 843The AMDGPU backend supports the following LLVM IR attributes. 844 845 .. table:: AMDGPU LLVM IR Attributes 846 :name: amdgpu-llvm-ir-attributes-table 847 848 ======================================= ========================================================== 849 LLVM Attribute Description 850 ======================================= ========================================================== 851 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 852 will be specified when the kernel is dispatched. Generated 853 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 854 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 855 argument block size for the implicit arguments. This 856 varies by OS and language (for OpenCL see 857 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 858 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 859 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 860 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 861 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 862 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 863 execution unit. Generated by the ``amdgpu_waves_per_eu`` 864 CLANG attribute [CLANG-ATTR]_. 865 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 866 mode register to be set on entry. Overrides the default for 867 the calling convention. 868 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 869 the mode register to be set on entry. Overrides the default 870 for the calling convention. 871 ======================================= ========================================================== 872 873.. _amdgpu-elf-code-object: 874 875ELF Code Object 876=============== 877 878The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 879can be linked by ``lld`` to produce a standard ELF shared code object which can 880be loaded and executed on an AMDGPU target. 881 882.. _amdgpu-elf-header: 883 884Header 885------ 886 887The AMDGPU backend uses the following ELF header: 888 889 .. table:: AMDGPU ELF Header 890 :name: amdgpu-elf-header-table 891 892 ========================== =============================== 893 Field Value 894 ========================== =============================== 895 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 896 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 897 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 898 - ``ELFOSABI_AMDGPU_HSA`` 899 - ``ELFOSABI_AMDGPU_PAL`` 900 - ``ELFOSABI_AMDGPU_MESA3D`` 901 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 902 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 903 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 904 - ``ELFABIVERSION_AMDGPU_PAL`` 905 - ``ELFABIVERSION_AMDGPU_MESA3D`` 906 ``e_type`` - ``ET_REL`` 907 - ``ET_DYN`` 908 ``e_machine`` ``EM_AMDGPU`` 909 ``e_entry`` 0 910 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 911 :ref:`amdgpu-elf-header-e_flags-table-v3`, 912 and :ref:`amdgpu-elf-header-e_flags-table-v4` 913 ========================== =============================== 914 915.. 916 917 .. table:: AMDGPU ELF Header Enumeration Values 918 :name: amdgpu-elf-header-enumeration-values-table 919 920 =============================== ===== 921 Name Value 922 =============================== ===== 923 ``EM_AMDGPU`` 224 924 ``ELFOSABI_NONE`` 0 925 ``ELFOSABI_AMDGPU_HSA`` 64 926 ``ELFOSABI_AMDGPU_PAL`` 65 927 ``ELFOSABI_AMDGPU_MESA3D`` 66 928 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 929 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 930 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 931 ``ELFABIVERSION_AMDGPU_PAL`` 0 932 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 933 =============================== ===== 934 935``e_ident[EI_CLASS]`` 936 The ELF class is: 937 938 * ``ELFCLASS32`` for ``r600`` architecture. 939 940 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 941 process address space applications. 942 943``e_ident[EI_DATA]`` 944 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 945 946``e_ident[EI_OSABI]`` 947 One of the following AMDGPU target architecture specific OS ABIs 948 (see :ref:`amdgpu-os`): 949 950 * ``ELFOSABI_NONE`` for *unknown* OS. 951 952 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 953 954 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 955 956 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 957 958``e_ident[EI_ABIVERSION]`` 959 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 960 object conforms: 961 962 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 963 runtime ABI for code object V2. Specify using the Clang option 964 ``-mcode-object-version=2``. 965 966 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 967 runtime ABI for code object V3. Specify using the Clang option 968 ``-mcode-object-version=3``. This is the default code object 969 version if not specified. 970 971 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 972 runtime ABI for code object V4. Specify using the Clang option 973 ``-mcode-object-version=4``. 974 975 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 976 runtime ABI. 977 978 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 979 3D runtime ABI. 980 981``e_type`` 982 Can be one of the following values: 983 984 985 ``ET_REL`` 986 The type produced by the AMDGPU backend compiler as it is relocatable code 987 object. 988 989 ``ET_DYN`` 990 The type produced by the linker as it is a shared code object. 991 992 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 993 994``e_machine`` 995 The value ``EM_AMDGPU`` is used for the machine for all processors supported 996 by the ``r600`` and ``amdgcn`` architectures (see 997 :ref:`amdgpu-processor-table`). The specific processor is specified in the 998 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 999 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1000 ``e_flags`` for code object V3 to V4 (see 1001 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1002 :ref:`amdgpu-elf-header-e_flags-table-v4`). 1003 1004``e_entry`` 1005 The entry point is 0 as the entry points for individual kernels must be 1006 selected in order to invoke them through AQL packets. 1007 1008``e_flags`` 1009 The AMDGPU backend uses the following ELF header flags: 1010 1011 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1012 :name: amdgpu-elf-header-e_flags-v2-table 1013 1014 ===================================== ===== ============================= 1015 Name Value Description 1016 ===================================== ===== ============================= 1017 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1018 target feature is 1019 enabled for all code 1020 contained in the code object. 1021 If the processor 1022 does not support the 1023 ``xnack`` target 1024 feature then must 1025 be 0. 1026 See 1027 :ref:`amdgpu-target-features`. 1028 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1029 handler is enabled for all 1030 code contained in the code 1031 object. If the processor 1032 does not support a trap 1033 handler then must be 0. 1034 See 1035 :ref:`amdgpu-target-features`. 1036 ===================================== ===== ============================= 1037 1038 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1039 :name: amdgpu-elf-header-e_flags-table-v3 1040 1041 ================================= ===== ============================= 1042 Name Value Description 1043 ================================= ===== ============================= 1044 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1045 mask for 1046 ``EF_AMDGPU_MACH_xxx`` values 1047 defined in 1048 :ref:`amdgpu-ef-amdgpu-mach-table`. 1049 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1050 target feature is 1051 enabled for all code 1052 contained in the code object. 1053 If the processor 1054 does not support the 1055 ``xnack`` target 1056 feature then must 1057 be 0. 1058 See 1059 :ref:`amdgpu-target-features`. 1060 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1061 target feature is 1062 enabled for all code 1063 contained in the code object. 1064 If the processor 1065 does not support the 1066 ``sramecc`` target 1067 feature then must 1068 be 0. 1069 See 1070 :ref:`amdgpu-target-features`. 1071 ================================= ===== ============================= 1072 1073 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 1074 :name: amdgpu-elf-header-e_flags-table-v4 1075 1076 ============================================ ===== =================================== 1077 Name Value Description 1078 ============================================ ===== =================================== 1079 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1080 mask for 1081 ``EF_AMDGPU_MACH_xxx`` values 1082 defined in 1083 :ref:`amdgpu-ef-amdgpu-mach-table`. 1084 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1085 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1086 values. 1087 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1088 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1089 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1090 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1091 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1092 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1093 values. 1094 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1095 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1096 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1097 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1098 ============================================ ===== =================================== 1099 1100 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1101 :name: amdgpu-ef-amdgpu-mach-table 1102 1103 ==================================== ========== ============================= 1104 Name Value Description (see 1105 :ref:`amdgpu-processor-table`) 1106 ==================================== ========== ============================= 1107 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1108 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1109 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1110 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1111 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1112 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1113 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1114 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1115 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1116 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1117 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1118 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1119 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1120 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1121 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1122 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1123 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1124 *reserved* 0x011 - Reserved for ``r600`` 1125 0x01f architecture processors. 1126 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1127 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1128 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1129 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1130 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1131 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1132 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1133 *reserved* 0x027 Reserved. 1134 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1135 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1136 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1137 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1138 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1139 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1140 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1141 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1142 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1143 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1144 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1145 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1146 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1147 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1148 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1149 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1150 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1151 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1152 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1153 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1154 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1155 *reserved* 0x03d Reserved. 1156 *reserved* 0x03e Reserved. 1157 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1158 ==================================== ========== ============================= 1159 1160Sections 1161-------- 1162 1163An AMDGPU target ELF code object has the standard ELF sections which include: 1164 1165 .. table:: AMDGPU ELF Sections 1166 :name: amdgpu-elf-sections-table 1167 1168 ================== ================ ================================= 1169 Name Type Attributes 1170 ================== ================ ================================= 1171 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1172 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1173 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1174 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1175 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1176 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1177 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1178 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1179 ``.note`` ``SHT_NOTE`` *none* 1180 ``.rela``\ *name* ``SHT_RELA`` *none* 1181 ``.rela.dyn`` ``SHT_RELA`` *none* 1182 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1183 ``.shstrtab`` ``SHT_STRTAB`` *none* 1184 ``.strtab`` ``SHT_STRTAB`` *none* 1185 ``.symtab`` ``SHT_SYMTAB`` *none* 1186 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1187 ================== ================ ================================= 1188 1189These sections have their standard meanings (see [ELF]_) and are only generated 1190if needed. 1191 1192``.debug``\ *\** 1193 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1194 information on the DWARF produced by the AMDGPU backend. 1195 1196``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1197 The standard sections used by a dynamic loader. 1198 1199``.note`` 1200 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1201 backend. 1202 1203``.rela``\ *name*, ``.rela.dyn`` 1204 For relocatable code objects, *name* is the name of the section that the 1205 relocation records apply. For example, ``.rela.text`` is the section name for 1206 relocation records associated with the ``.text`` section. 1207 1208 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1209 records from each of the relocatable code object's ``.rela``\ *name* sections. 1210 1211 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1212 the AMDGPU backend. 1213 1214``.text`` 1215 The executable machine code for the kernels and functions they call. Generated 1216 as position independent code. See :ref:`amdgpu-code-conventions` for 1217 information on conventions used in the isa generation. 1218 1219.. _amdgpu-note-records: 1220 1221Note Records 1222------------ 1223 1224The AMDGPU backend code object contains ELF note records in the ``.note`` 1225section. The set of generated notes and their semantics depend on the code 1226object version; see :ref:`amdgpu-note-records-v2` and 1227:ref:`amdgpu-note-records-v3-v4`. 1228 1229As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1230must be generated after the ``name`` field to ensure the ``desc`` field is 4 1231byte aligned. In addition, minimal zero-byte padding must be generated to 1232ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1233field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1234alignment. 1235 1236.. _amdgpu-note-records-v2: 1237 1238Code Object V2 Note Records 1239~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1240 1241.. warning:: 1242 Code object V2 is not the default code object version emitted by 1243 this version of LLVM. 1244 1245The AMDGPU backend code object uses the following ELF note record in the 1246``.note`` section when compiling for code object V2. 1247 1248The note record vendor field is "AMD". 1249 1250Additional note records may be present, but any which are not documented here 1251are deprecated and should not be used. 1252 1253 .. table:: AMDGPU Code Object V2 ELF Note Records 1254 :name: amdgpu-elf-note-records-v2-table 1255 1256 ===== ===================================== ====================================== 1257 Name Type Description 1258 ===== ===================================== ====================================== 1259 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1260 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1261 Finalizer and not the LLVM compiler. 1262 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1263 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1264 YAML [YAML]_ textual format. 1265 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1266 ===== ===================================== ====================================== 1267 1268.. 1269 1270 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1271 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1272 1273 ===================================== ===== 1274 Name Value 1275 ===================================== ===== 1276 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1277 ``NT_AMD_HSA_HSAIL`` 2 1278 ``NT_AMD_HSA_ISA_VERSION`` 3 1279 *reserved* 4-9 1280 ``NT_AMD_HSA_METADATA`` 10 1281 ``NT_AMD_HSA_ISA_NAME`` 11 1282 ===================================== ===== 1283 1284``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1285 Specifies the code object version number. The description field has the 1286 following layout: 1287 1288 .. code:: 1289 1290 struct amdgpu_hsa_note_code_object_version_s { 1291 uint32_t major_version; 1292 uint32_t minor_version; 1293 }; 1294 1295 The ``major_version`` has a value less than or equal to 2. 1296 1297``NT_AMD_HSA_HSAIL`` 1298 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1299 field has the following layout: 1300 1301 .. code:: 1302 1303 struct amdgpu_hsa_note_hsail_s { 1304 uint32_t hsail_major_version; 1305 uint32_t hsail_minor_version; 1306 uint8_t profile; 1307 uint8_t machine_model; 1308 uint8_t default_float_round; 1309 }; 1310 1311``NT_AMD_HSA_ISA_VERSION`` 1312 Specifies the target ISA version. The description field has the following layout: 1313 1314 .. code:: 1315 1316 struct amdgpu_hsa_note_isa_s { 1317 uint16_t vendor_name_size; 1318 uint16_t architecture_name_size; 1319 uint32_t major; 1320 uint32_t minor; 1321 uint32_t stepping; 1322 char vendor_and_architecture_name[1]; 1323 }; 1324 1325 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1326 vendor and architecture names respectively, including the NUL character. 1327 1328 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1329 vendor, immediately followed by the NUL terminated string for the 1330 architecture. 1331 1332 This note record is used by the HSA runtime loader. 1333 1334 Code object V2 only supports a limited number of processors and has fixed 1335 settings for target features. See 1336 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1337 processors and the corresponding target ID. In the table the note record ISA 1338 name is a concatenation of the vendor name, architecture name, major, minor, 1339 and stepping separated by a ":". 1340 1341 The target ID column shows the processor name and fixed target features used 1342 by the LLVM compiler. The LLVM compiler does not generate a 1343 ``NT_AMD_HSA_HSAIL`` note record. 1344 1345 A code object generated by the Finalizer also uses code object V2 and always 1346 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1347 ``sramecc`` target feature is as shown in 1348 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1349 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1350 bit. 1351 1352``NT_AMD_HSA_ISA_NAME`` 1353 Specifies the target ISA name as a non-NUL terminated string. 1354 1355 This note record is not used by the HSA runtime loader. 1356 1357 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1358 V2's limited support of processors and fixed settings for target features. 1359 1360 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1361 from the string to the corresponding target ID. If the ``xnack`` target 1362 feature is supported and enabled, the string produced by the LLVM compiler 1363 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1364 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1365 1366``NT_AMD_HSA_METADATA`` 1367 Specifies extensible metadata associated with the code objects executed on HSA 1368 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1369 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1370 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1371 metadata string. 1372 1373 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1374 :name: amdgpu-elf-note-record-supported_processors-v2-table 1375 1376 ==================== ========================== 1377 Note Record ISA Name Target ID 1378 ==================== ========================== 1379 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1380 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1381 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1382 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1383 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1384 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1385 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1386 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1387 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1388 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1389 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1390 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1391 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1392 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1393 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1394 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1395 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1396 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1397 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1398 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1399 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1400 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1401 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1402 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1403 ==================== ========================== 1404 1405.. _amdgpu-note-records-v3-v4: 1406 1407Code Object V3 to V4 Note Records 1408~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1409 1410The AMDGPU backend code object uses the following ELF note record in the 1411``.note`` section when compiling for code object V3 to V4. 1412 1413The note record vendor field is "AMDGPU". 1414 1415Additional note records may be present, but any which are not documented here 1416are deprecated and should not be used. 1417 1418 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records 1419 :name: amdgpu-elf-note-records-table-v3-v4 1420 1421 ======== ============================== ====================================== 1422 Name Type Description 1423 ======== ============================== ====================================== 1424 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1425 binary format. 1426 ======== ============================== ====================================== 1427 1428.. 1429 1430 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values 1431 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4 1432 1433 ============================== ===== 1434 Name Value 1435 ============================== ===== 1436 *reserved* 0-31 1437 ``NT_AMDGPU_METADATA`` 32 1438 ============================== ===== 1439 1440``NT_AMDGPU_METADATA`` 1441 Specifies extensible metadata associated with an AMDGPU code object. It is 1442 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1443 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and 1444 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the 1445 ``amdhsa`` OS. 1446 1447.. _amdgpu-symbols: 1448 1449Symbols 1450------- 1451 1452Symbols include the following: 1453 1454 .. table:: AMDGPU ELF Symbols 1455 :name: amdgpu-elf-symbols-table 1456 1457 ===================== ================== ================ ================== 1458 Name Type Section Description 1459 ===================== ================== ================ ================== 1460 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1461 - ``.rodata`` 1462 - ``.bss`` 1463 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1464 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1465 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1466 ===================== ================== ================ ================== 1467 1468Global variable 1469 Global variables both used and defined by the compilation unit. 1470 1471 If the symbol is defined in the compilation unit then it is allocated in the 1472 appropriate section according to if it has initialized data or is readonly. 1473 1474 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1475 will resolve relocations using the definition provided by another code object 1476 or explicitly defined by the runtime. 1477 1478 If the symbol resides in local/group memory (LDS) then its section is the 1479 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1480 ``st_value`` field describes alignment requirements as it does for common 1481 symbols. 1482 1483 .. TODO:: 1484 1485 Add description of linked shared object symbols. Seems undefined symbols 1486 are marked as STT_NOTYPE. 1487 1488Kernel descriptor 1489 Every HSA kernel has an associated kernel descriptor. It is the address of the 1490 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1491 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1492 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1493 1494Kernel entry point 1495 Every HSA kernel also has a symbol for its machine code entry point. 1496 1497.. _amdgpu-relocation-records: 1498 1499Relocation Records 1500------------------ 1501 1502AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1503relocatable fields are: 1504 1505``word32`` 1506 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1507 alignment. These values use the same byte order as other word values in the 1508 AMDGPU architecture. 1509 1510``word64`` 1511 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1512 alignment. These values use the same byte order as other word values in the 1513 AMDGPU architecture. 1514 1515Following notations are used for specifying relocation calculations: 1516 1517**A** 1518 Represents the addend used to compute the value of the relocatable field. 1519 1520**G** 1521 Represents the offset into the global offset table at which the relocation 1522 entry's symbol will reside during execution. 1523 1524**GOT** 1525 Represents the address of the global offset table. 1526 1527**P** 1528 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1529 of the storage unit being relocated (computed using ``r_offset``). 1530 1531**S** 1532 Represents the value of the symbol whose index resides in the relocation 1533 entry. Relocations not using this must specify a symbol index of 1534 ``STN_UNDEF``. 1535 1536**B** 1537 Represents the base address of a loaded executable or shared object which is 1538 the difference between the ELF address and the actual load address. 1539 Relocations using this are only valid in executable or shared objects. 1540 1541The following relocation types are supported: 1542 1543 .. table:: AMDGPU ELF Relocation Records 1544 :name: amdgpu-elf-relocation-records-table 1545 1546 ========================== ======= ===== ========== ============================== 1547 Relocation Type Kind Value Field Calculation 1548 ========================== ======= ===== ========== ============================== 1549 ``R_AMDGPU_NONE`` 0 *none* *none* 1550 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1551 Dynamic 1552 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1553 Dynamic 1554 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1555 Dynamic 1556 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1557 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1558 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1559 Dynamic 1560 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1561 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1562 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1563 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1564 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1565 *reserved* 12 1566 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1567 ========================== ======= ===== ========== ============================== 1568 1569``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1570the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1571 1572There is no current OS loader support for 32-bit programs and so 1573``R_AMDGPU_ABS32`` is not used. 1574 1575.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1576 1577Loaded Code Object Path Uniform Resource Identifier (URI) 1578--------------------------------------------------------- 1579 1580The AMD GPU code object loader represents the path of the ELF shared object from 1581which the code object was loaded as a textual Unifom Resource Identifier (URI). 1582Note that the code object is the in memory loaded relocated form of the ELF 1583shared object. Multiple code objects may be loaded at different memory 1584addresses in the same process from the same ELF shared object. 1585 1586The loaded code object path URI syntax is defined by the following BNF syntax: 1587 1588.. code:: 1589 1590 code_object_uri ::== file_uri | memory_uri 1591 file_uri ::== "file://" file_path [ range_specifier ] 1592 memory_uri ::== "memory://" process_id range_specifier 1593 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1594 file_path ::== URI_ENCODED_OS_FILE_PATH 1595 process_id ::== DECIMAL_NUMBER 1596 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1597 1598**number** 1599 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1600 and octal values by "0". 1601 1602**file_path** 1603 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1604 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1605 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1606 the path are separated by "/". 1607 1608**offset** 1609 Is a 0-based byte offset to the start of the code object. For a file URI, it 1610 is from the start of the file specified by the ``file_path``, and if omitted 1611 defaults to 0. For a memory URI, it is the memory address and is required. 1612 1613**size** 1614 Is the number of bytes in the code object. For a file URI, if omitted it 1615 defaults to the size of the file. It is required for a memory URI. 1616 1617**process_id** 1618 Is the identity of the process owning the memory. For Linux it is the C 1619 unsigned integral decimal literal for the process ID (PID). 1620 1621For example: 1622 1623.. code:: 1624 1625 file:///dir1/dir2/file1 1626 file:///dir3/dir4/file2#offset=0x2000&size=3000 1627 memory://1234#offset=0x20000&size=3000 1628 1629.. _amdgpu-dwarf-debug-information: 1630 1631DWARF Debug Information 1632======================= 1633 1634.. warning:: 1635 1636 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1637 is not currently fully implemented and is subject to change. 1638 1639AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1640:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1641object executable code and data to the source language constructs. It can be 1642used by tools such as debuggers and profilers. It uses features defined in 1643:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1644DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1645 1646This section defines the AMDGPU target architecture specific DWARF mappings. 1647 1648.. _amdgpu-dwarf-register-identifier: 1649 1650Register Identifier 1651------------------- 1652 1653This section defines the AMDGPU target architecture register numbers used in 1654DWARF operation expressions (see DWARF Version 5 section 2.5 and 1655:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1656instructions (see DWARF Version 5 section 6.4 and 1657:ref:`amdgpu-dwarf-call-frame-information`). 1658 1659A single code object can contain code for kernels that have different wavefront 1660sizes. The vector registers and some scalar registers are based on the wavefront 1661size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1662simplifies the consumer of the DWARF so that each register has a fixed size, 1663rather than being dynamic according to the wavefront size mode. Similarly, 1664distinct DWARF registers are defined for those registers that vary in size 1665according to the process address size. This allows a consumer to treat a 1666specific AMDGPU processor as a single architecture regardless of how it is 1667configured at run time. The compiler explicitly specifies the DWARF registers 1668that match the mode in which the code it is generating will be executed. 1669 1670DWARF registers are encoded as numbers, which are mapped to architecture 1671registers. The mapping for AMDGPU is defined in 1672:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1673mapping. 1674 1675.. table:: AMDGPU DWARF Register Mapping 1676 :name: amdgpu-dwarf-register-mapping-table 1677 1678 ============== ================= ======== ================================== 1679 DWARF Register AMDGPU Register Bit Size Description 1680 ============== ================= ======== ================================== 1681 0 PC_32 32 Program Counter (PC) when 1682 executing in a 32-bit process 1683 address space. Used in the CFI to 1684 describe the PC of the calling 1685 frame. 1686 1 EXEC_MASK_32 32 Execution Mask Register when 1687 executing in wavefront 32 mode. 1688 2-15 *Reserved* *Reserved for highly accessed 1689 registers using DWARF shortcut.* 1690 16 PC_64 64 Program Counter (PC) when 1691 executing in a 64-bit process 1692 address space. Used in the CFI to 1693 describe the PC of the calling 1694 frame. 1695 17 EXEC_MASK_64 64 Execution Mask Register when 1696 executing in wavefront 64 mode. 1697 18-31 *Reserved* *Reserved for highly accessed 1698 registers using DWARF shortcut.* 1699 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1700 Registers. 1701 96-127 *Reserved* *Reserved for frequently accessed 1702 registers using DWARF 1-byte ULEB.* 1703 128 STATUS 32 Status Register. 1704 129-511 *Reserved* *Reserved for future Scalar 1705 Architectural Registers.* 1706 512 VCC_32 32 Vector Condition Code Register 1707 when executing in wavefront 32 1708 mode. 1709 513-1023 *Reserved* *Reserved for future Vector 1710 Architectural Registers when 1711 executing in wavefront 32 mode.* 1712 768 VCC_64 64 Vector Condition Code Register 1713 when executing in wavefront 64 1714 mode. 1715 769-1023 *Reserved* *Reserved for future Vector 1716 Architectural Registers when 1717 executing in wavefront 64 mode.* 1718 1024-1087 *Reserved* *Reserved for padding.* 1719 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1720 1130-1535 *Reserved* *Reserved for future Scalar 1721 General Purpose Registers.* 1722 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1723 when executing in wavefront 32 1724 mode. 1725 1792-2047 *Reserved* *Reserved for future Vector 1726 General Purpose Registers when 1727 executing in wavefront 32 mode.* 1728 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1729 when executing in wavefront 32 1730 mode. 1731 2304-2559 *Reserved* *Reserved for future Vector 1732 Accumulation Registers when 1733 executing in wavefront 32 mode.* 1734 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1735 when executing in wavefront 64 1736 mode. 1737 2816-3071 *Reserved* *Reserved for future Vector 1738 General Purpose Registers when 1739 executing in wavefront 64 mode.* 1740 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1741 when executing in wavefront 64 1742 mode. 1743 3328-3583 *Reserved* *Reserved for future Vector 1744 Accumulation Registers when 1745 executing in wavefront 64 mode.* 1746 ============== ================= ======== ================================== 1747 1748The vector registers are represented as the full size for the wavefront. They 1749are organized as consecutive dwords (32-bits), one per lane, with the dword at 1750the least significant bit position corresponding to lane 0 and so forth. DWARF 1751location expressions involving the ``DW_OP_LLVM_offset`` and 1752``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1753register corresponding to the lane that is executing the current thread of 1754execution in languages that are implemented using a SIMD or SIMT execution 1755model. 1756 1757If the wavefront size is 32 lanes then the wavefront 32 mode register 1758definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1759mode register definitions are used. Some AMDGPU targets support executing in 1760both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1761to the wavefront mode of the generated code will be used. 1762 1763If code is generated to execute in a 32-bit process address space, then the 176432-bit process address space register definitions are used. If code is generated 1765to execute in a 64-bit process address space, then the 64-bit process address 1766space register definitions are used. The ``amdgcn`` target only supports the 176764-bit process address space. 1768 1769.. _amdgpu-dwarf-address-class-identifier: 1770 1771Address Class Identifier 1772------------------------ 1773 1774The DWARF address class represents the source language memory space. See DWARF 1775Version 5 section 2.12 which is updated by the *DWARF Extensions For 1776Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1777 1778The DWARF address class mapping used for AMDGPU is defined in 1779:ref:`amdgpu-dwarf-address-class-mapping-table`. 1780 1781.. table:: AMDGPU DWARF Address Class Mapping 1782 :name: amdgpu-dwarf-address-class-mapping-table 1783 1784 ========================= ====== ================= 1785 DWARF AMDGPU 1786 -------------------------------- ----------------- 1787 Address Class Name Value Address Space 1788 ========================= ====== ================= 1789 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1790 ``DW_ADDR_LLVM_global`` 0x0001 Global 1791 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1792 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1793 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1794 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1795 ========================= ====== ================= 1796 1797The DWARF address class values defined in the *DWARF Extensions For 1798Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1799 1800In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1801available for use for the AMD extension for access to the hardware GDS memory 1802which is scratchpad memory allocated per device. 1803 1804For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1805address class of ``DW_ADDR_none`` is used. 1806 1807See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1808mapping of DWARF address classes to DWARF address spaces, including address size 1809and NULL value. 1810 1811.. _amdgpu-dwarf-address-space-identifier: 1812 1813Address Space Identifier 1814------------------------ 1815 1816DWARF address spaces correspond to target architecture specific linear 1817addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1818For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1819 1820The DWARF address space mapping used for AMDGPU is defined in 1821:ref:`amdgpu-dwarf-address-space-mapping-table`. 1822 1823.. table:: AMDGPU DWARF Address Space Mapping 1824 :name: amdgpu-dwarf-address-space-mapping-table 1825 1826 ======================================= ===== ======= ======== ================= ======================= 1827 DWARF AMDGPU Notes 1828 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1829 Address Space Name Value Address Bit Size Address Space 1830 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1831 .. 64-bit 32-bit 1832 process process 1833 address address 1834 space space 1835 ======================================= ===== ======= ======== ================= ======================= 1836 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1837 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1838 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1839 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1840 *Reserved* 0x04 1841 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1842 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1843 ======================================= ===== ======= ======== ================= ======================= 1844 1845See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1846including address size and NULL value. 1847 1848The ``DW_ASPACE_none`` address space is the default target architecture address 1849space used in DWARF operations that do not specify an address space. It 1850therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1851related operations can refer to addresses in the program code. 1852 1853The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1854specify the flat address space. If the address corresponds to an address in the 1855local address space, then it corresponds to the wavefront that is executing the 1856focused thread of execution. If the address corresponds to an address in the 1857private address space, then it corresponds to the lane that is executing the 1858focused thread of execution for languages that are implemented using a SIMD or 1859SIMT execution model. 1860 1861.. note:: 1862 1863 CUDA-like languages such as HIP that do not have address spaces in the 1864 language type system, but do allow variables to be allocated in different 1865 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1866 address space in the DWARF expression operations as the default address space 1867 is the global address space. 1868 1869The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1870specify the local address space corresponding to the wavefront that is executing 1871the focused thread of execution. 1872 1873The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1874to specify the private address space corresponding to the lane that is executing 1875the focused thread of execution for languages that are implemented using a SIMD 1876or SIMT execution model. 1877 1878The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1879to specify the unswizzled private address space corresponding to the wavefront 1880that is executing the focused thread of execution. The wavefront view of private 1881memory is the per wavefront unswizzled backing memory layout defined in 1882:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1883location for the backing memory of the wavefront (namely the address is not 1884offset by ``wavefront-scratch-base``). The following formula can be used to 1885convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1886``DW_ASPACE_AMDGPU_private_wave`` address: 1887 1888:: 1889 1890 private-address-wavefront = 1891 ((private-address-lane / 4) * wavefront-size * 4) + 1892 (wavefront-lane-id * 4) + (private-address-lane % 4) 1893 1894If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1895of the dwords for each lane starting with lane 0 is required, then this 1896simplifies to: 1897 1898:: 1899 1900 private-address-wavefront = 1901 private-address-lane * wavefront-size 1902 1903A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1904complete spilled vector register back into a complete vector register in the 1905CFI. The frame pointer can be a private lane address which is dword aligned, 1906which can be shifted to multiply by the wavefront size, and then used to form a 1907private wavefront address that gives a location for a contiguous set of dwords, 1908one per lane, where the vector register dwords are spilled. The compiler knows 1909the wavefront size since it generates the code. Note that the type of the 1910address may have to be converted as the size of a 1911``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1912``DW_ASPACE_AMDGPU_private_wave`` address. 1913 1914.. _amdgpu-dwarf-lane-identifier: 1915 1916Lane identifier 1917--------------- 1918 1919DWARF lane identifies specify a target architecture lane position for hardware 1920that executes in a SIMD or SIMT manner, and on which a source language maps its 1921threads of execution onto those lanes. The DWARF lane identifier is pushed by 1922the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1923section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 1924section :ref:`amdgpu-dwarf-operation-expressions`. 1925 1926For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1927wavefront. It is numbered from 0 to the wavefront size minus 1. 1928 1929Operation Expressions 1930--------------------- 1931 1932DWARF expressions are used to compute program values and the locations of 1933program objects. See DWARF Version 5 section 2.5 and 1934:ref:`amdgpu-dwarf-operation-expressions`. 1935 1936DWARF location descriptions describe how to access storage which includes memory 1937and registers. When accessing storage on AMDGPU, bytes are ordered with least 1938significant bytes first, and bits are ordered within bytes with least 1939significant bits first. 1940 1941For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 1942unwinding vector registers that are spilled under the execution mask to memory: 1943the zero-single location description is the vector register, and the one-single 1944location description is the spilled memory location description. The 1945``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 1946memory location description. 1947 1948In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 1949``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 1950controlled by the execution mask. An undefined location description together 1951with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 1952to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 1953 1954Debugger Information Entry Attributes 1955------------------------------------- 1956 1957This section describes how certain debugger information entry attributes are 1958used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated 1959by *DWARF Extensions For Heterogeneous Debugging* section 1960:ref:`amdgpu-dwarf-debugging-information-entry-attributes`. 1961 1962.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 1963 1964``DW_AT_LLVM_lane_pc`` 1965~~~~~~~~~~~~~~~~~~~~~~ 1966 1967For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 1968location of the separate lanes of a SIMT thread. 1969 1970If the lane is an active lane then this will be the same as the current program 1971location. 1972 1973If the lane is inactive, but was active on entry to the subprogram, then this is 1974the program location in the subprogram at which execution of the lane is 1975conceptual positioned. 1976 1977If the lane was not active on entry to the subprogram, then this will be the 1978undefined location. A client debugger can check if the lane is part of a valid 1979work-group by checking that the lane is in the range of the associated 1980work-group within the grid, accounting for partial work-groups. If it is not, 1981then the debugger can omit any information for the lane. Otherwise, the debugger 1982may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 1983calling subprogram until it finds a non-undefined location. Conceptually the 1984lane only has the call frames that it has a non-undefined 1985``DW_AT_LLVM_lane_pc``. 1986 1987The following example illustrates how the AMDGPU backend can generate a DWARF 1988location list expression for the nested ``IF/THEN/ELSE`` structures of the 1989following subprogram pseudo code for a target with 64 lanes per wavefront. 1990 1991.. code:: 1992 :number-lines: 1993 1994 SUBPROGRAM X 1995 BEGIN 1996 a; 1997 IF (c1) THEN 1998 b; 1999 IF (c2) THEN 2000 c; 2001 ELSE 2002 d; 2003 ENDIF 2004 e; 2005 ELSE 2006 f; 2007 ENDIF 2008 g; 2009 END 2010 2011The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2012execution mask (``EXEC``) to linearize the control flow. The condition is 2013evaluated to make a mask of the lanes for which the condition evaluates to true. 2014First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2015logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2016``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2017the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2018region the ``EXEC`` mask is restored to the value it had at the beginning of the 2019region. This is shown below. Other approaches are possible, but the basic 2020concept is the same. 2021 2022.. code:: 2023 :number-lines: 2024 2025 $lex_start: 2026 a; 2027 %1 = EXEC 2028 %2 = c1 2029 $lex_1_start: 2030 EXEC = %1 & %2 2031 $if_1_then: 2032 b; 2033 %3 = EXEC 2034 %4 = c2 2035 $lex_1_1_start: 2036 EXEC = %3 & %4 2037 $lex_1_1_then: 2038 c; 2039 EXEC = ~EXEC & %3 2040 $lex_1_1_else: 2041 d; 2042 EXEC = %3 2043 $lex_1_1_end: 2044 e; 2045 EXEC = ~EXEC & %1 2046 $lex_1_else: 2047 f; 2048 EXEC = %1 2049 $lex_1_end: 2050 g; 2051 $lex_end: 2052 2053To create the DWARF location list expression that defines the location 2054description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2055pseudo instruction can be used to annotate the linearized control flow. This can 2056be done by defining an artificial variable for the lane PC. The DWARF location 2057list expression created for it is used as the value of the 2058``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2059 2060A DWARF procedure is defined for each well nested structured control flow region 2061which provides the conceptual lane program location for a lane if it is not 2062active (namely it is divergent). The DWARF operation expression for each region 2063conceptually inherits the value of the immediately enclosing region and modifies 2064it according to the semantics of the region. 2065 2066For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2067the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2068region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2069region since the ``THEN`` region has completed. 2070 2071The lane PC artificial variable is assigned at each region transition. It uses 2072the immediately enclosing region's DWARF procedure to compute the program 2073location for each lane assuming they are divergent, and then modifies the result 2074by inserting the current program location for each lane that the ``EXEC`` mask 2075indicates is active. 2076 2077By having separate DWARF procedures for each region, they can be reused to 2078define the value for any nested region. This reduces the total size of the DWARF 2079operation expressions. 2080 2081The following provides an example using pseudo LLVM MIR. 2082 2083.. code:: 2084 :number-lines: 2085 2086 $lex_start: 2087 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2088 DW_AT_name = "__uint64"; 2089 DW_AT_byte_size = 8; 2090 DW_AT_encoding = DW_ATE_unsigned; 2091 ]; 2092 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2093 DW_AT_name = "__active_lane_pc"; 2094 DW_AT_location = [ 2095 DW_OP_regx PC; 2096 DW_OP_LLVM_extend 64, 64; 2097 DW_OP_regval_type EXEC, %uint_64; 2098 DW_OP_LLVM_select_bit_piece 64, 64; 2099 ]; 2100 ]; 2101 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2102 DW_AT_name = "__divergent_lane_pc"; 2103 DW_AT_location = [ 2104 DW_OP_LLVM_undefined; 2105 DW_OP_LLVM_extend 64, 64; 2106 ]; 2107 ]; 2108 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2109 DW_OP_call_ref %__divergent_lane_pc; 2110 DW_OP_call_ref %__active_lane_pc; 2111 ]; 2112 a; 2113 %1 = EXEC; 2114 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2115 %2 = c1; 2116 $lex_1_start: 2117 EXEC = %1 & %2; 2118 $lex_1_then: 2119 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2120 DW_AT_name = "__divergent_lane_pc_1_then"; 2121 DW_AT_location = DIExpression[ 2122 DW_OP_call_ref %__divergent_lane_pc; 2123 DW_OP_addrx &lex_1_start; 2124 DW_OP_stack_value; 2125 DW_OP_LLVM_extend 64, 64; 2126 DW_OP_call_ref %__lex_1_save_exec; 2127 DW_OP_deref_type 64, %__uint_64; 2128 DW_OP_LLVM_select_bit_piece 64, 64; 2129 ]; 2130 ]; 2131 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2132 DW_OP_call_ref %__divergent_lane_pc_1_then; 2133 DW_OP_call_ref %__active_lane_pc; 2134 ]; 2135 b; 2136 %3 = EXEC; 2137 DBG_VALUE %3, %__lex_1_1_save_exec; 2138 %4 = c2; 2139 $lex_1_1_start: 2140 EXEC = %3 & %4; 2141 $lex_1_1_then: 2142 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2143 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2144 DW_AT_location = DIExpression[ 2145 DW_OP_call_ref %__divergent_lane_pc_1_then; 2146 DW_OP_addrx &lex_1_1_start; 2147 DW_OP_stack_value; 2148 DW_OP_LLVM_extend 64, 64; 2149 DW_OP_call_ref %__lex_1_1_save_exec; 2150 DW_OP_deref_type 64, %__uint_64; 2151 DW_OP_LLVM_select_bit_piece 64, 64; 2152 ]; 2153 ]; 2154 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2155 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2156 DW_OP_call_ref %__active_lane_pc; 2157 ]; 2158 c; 2159 EXEC = ~EXEC & %3; 2160 $lex_1_1_else: 2161 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2162 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2163 DW_AT_location = DIExpression[ 2164 DW_OP_call_ref %__divergent_lane_pc_1_then; 2165 DW_OP_addrx &lex_1_1_end; 2166 DW_OP_stack_value; 2167 DW_OP_LLVM_extend 64, 64; 2168 DW_OP_call_ref %__lex_1_1_save_exec; 2169 DW_OP_deref_type 64, %__uint_64; 2170 DW_OP_LLVM_select_bit_piece 64, 64; 2171 ]; 2172 ]; 2173 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2174 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2175 DW_OP_call_ref %__active_lane_pc; 2176 ]; 2177 d; 2178 EXEC = %3; 2179 $lex_1_1_end: 2180 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2181 DW_OP_call_ref %__divergent_lane_pc; 2182 DW_OP_call_ref %__active_lane_pc; 2183 ]; 2184 e; 2185 EXEC = ~EXEC & %1; 2186 $lex_1_else: 2187 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2188 DW_AT_name = "__divergent_lane_pc_1_else"; 2189 DW_AT_location = DIExpression[ 2190 DW_OP_call_ref %__divergent_lane_pc; 2191 DW_OP_addrx &lex_1_end; 2192 DW_OP_stack_value; 2193 DW_OP_LLVM_extend 64, 64; 2194 DW_OP_call_ref %__lex_1_save_exec; 2195 DW_OP_deref_type 64, %__uint_64; 2196 DW_OP_LLVM_select_bit_piece 64, 64; 2197 ]; 2198 ]; 2199 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2200 DW_OP_call_ref %__divergent_lane_pc_1_else; 2201 DW_OP_call_ref %__active_lane_pc; 2202 ]; 2203 f; 2204 EXEC = %1; 2205 $lex_1_end: 2206 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2207 DW_OP_call_ref %__divergent_lane_pc; 2208 DW_OP_call_ref %__active_lane_pc; 2209 ]; 2210 g; 2211 $lex_end: 2212 2213The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2214that are active, with the current program location. 2215 2216Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2217the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2218instruction, location list entries will be created that describe where the 2219artificial variables are allocated at any given program location. The compiler 2220may allocate them to registers or spill them to memory. 2221 2222The DWARF procedures for each region use the values of the saved execution mask 2223artificial variables to only update the lanes that are active on entry to the 2224region. All other lanes retain the value of the enclosing region where they were 2225last active. If they were not active on entry to the subprogram, then will have 2226the undefined location description. 2227 2228Other structured control flow regions can be handled similarly. For example, 2229loops would set the divergent program location for the region at the end of the 2230loop. Any lanes active will be in the loop, and any lanes not active must have 2231exited the loop. 2232 2233An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2234``IF/THEN/ELSE`` regions. 2235 2236The DWARF procedures can use the active lane artificial variable described in 2237:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2238``EXEC`` mask in order to support whole or quad wavefront mode. 2239 2240.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2241 2242``DW_AT_LLVM_active_lane`` 2243~~~~~~~~~~~~~~~~~~~~~~~~~~ 2244 2245The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2246entry is used to specify the lanes that are conceptually active for a SIMT 2247thread. 2248 2249The execution mask may be modified to implement whole or quad wavefront mode 2250operations. For example, all lanes may need to temporarily be made active to 2251execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2252update it to enable the necessary lanes, perform the operations, and then 2253restore the ``EXEC`` mask from the saved value. While executing the whole 2254wavefront region, the conceptual execution mask is the saved value, not the 2255``EXEC`` value. 2256 2257This is handled by defining an artificial variable for the active lane mask. The 2258active lane mask artificial variable would be the actual ``EXEC`` mask for 2259normal regions, and the saved execution mask for regions where the mask is 2260temporarily updated. The location list expression created for this artificial 2261variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2262attribute. 2263 2264``DW_AT_LLVM_augmentation`` 2265~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2266 2267For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2268debugger information entry has the following value for the augmentation string: 2269 2270:: 2271 2272 [amdgpu:v0.0] 2273 2274The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2275extensions used in the DWARF of the compilation unit. The version number 2276conforms to [SEMVER]_. 2277 2278Call Frame Information 2279---------------------- 2280 2281DWARF Call Frame Information (CFI) describes how a consumer can virtually 2282*unwind* call frames in a running process or core dump. See DWARF Version 5 2283section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2284 2285For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2286 22871. ``augmentation`` string contains the following null-terminated UTF-8 string: 2288 2289 :: 2290 2291 [amd:v0.0] 2292 2293 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2294 extensions used in this CIE or to the FDEs that use it. The version number 2295 conforms to [SEMVER]_. 2296 22972. ``address_size`` for the ``Global`` address space is defined in 2298 :ref:`amdgpu-dwarf-address-space-identifier`. 2299 23003. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2301 23024. ``code_alignment_factor`` is 4 bytes. 2303 2304 .. TODO:: 2305 2306 Add to :ref:`amdgpu-processor-table` table. 2307 23085. ``data_alignment_factor`` is 4 bytes. 2309 2310 .. TODO:: 2311 2312 Add to :ref:`amdgpu-processor-table` table. 2313 23146. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2315 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2316 23177. ``initial_instructions`` Since a subprogram X with fewer registers can be 2318 called from subprogram Y that has more allocated, X will not change any of 2319 the extra registers as it cannot access them. Therefore, the default rule 2320 for all columns is ``same value``. 2321 2322For AMDGPU the register number follows the numbering defined in 2323:ref:`amdgpu-dwarf-register-identifier`. 2324 2325For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2326the return address to get the address of a byte within the call site 2327instructions. See DWARF Version 5 section 6.4.4. 2328 2329Accelerated Access 2330------------------ 2331 2332See DWARF Version 5 section 6.1. 2333 2334Lookup By Name Section Header 2335~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2336 2337See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2338 2339For AMDGPU the lookup by name section header table: 2340 2341``augmentation_string_size`` (uword) 2342 2343 Set to the length of the ``augmentation_string`` value which is always a 2344 multiple of 4. 2345 2346``augmentation_string`` (sequence of UTF-8 characters) 2347 2348 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2349 2350 :: 2351 2352 [amdgpu:v0.0] 2353 2354 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2355 extensions used in the DWARF of this index. The version number conforms to 2356 [SEMVER]_. 2357 2358 .. note:: 2359 2360 This is different to the DWARF Version 5 definition that requires the first 2361 4 characters to be the vendor ID. But this is consistent with the other 2362 augmentation strings and does allow multiple vendor contributions. However, 2363 backwards compatibility may be more desirable. 2364 2365Lookup By Address Section Header 2366~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2367 2368See DWARF Version 5 section 6.1.2. 2369 2370For AMDGPU the lookup by address section header table: 2371 2372``address_size`` (ubyte) 2373 2374 Match the address size for the ``Global`` address space defined in 2375 :ref:`amdgpu-dwarf-address-space-identifier`. 2376 2377``segment_selector_size`` (ubyte) 2378 2379 AMDGPU does not use a segment selector so this is 0. The entries in the 2380 ``.debug_aranges`` do not have a segment selector. 2381 2382Line Number Information 2383----------------------- 2384 2385See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2386 2387AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2388The instruction set must be obtained from the ELF file header ``e_flags`` field 2389in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2390<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2391 2392.. TODO:: 2393 2394 Should the ``isa`` state machine register be used to indicate if the code is 2395 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2396 2397For AMDGPU the line number program header fields have the following values (see 2398DWARF Version 5 section 6.2.4): 2399 2400``address_size`` (ubyte) 2401 Matches the address size for the ``Global`` address space defined in 2402 :ref:`amdgpu-dwarf-address-space-identifier`. 2403 2404``segment_selector_size`` (ubyte) 2405 AMDGPU does not use a segment selector so this is 0. 2406 2407``minimum_instruction_length`` (ubyte) 2408 For GFX9-GFX10 this is 4. 2409 2410``maximum_operations_per_instruction`` (ubyte) 2411 For GFX9-GFX10 this is 1. 2412 2413Source text for online-compiled programs (for example, those compiled by the 2414OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2415See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2416Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2417<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2418 2419The Clang option used to control source embedding in AMDGPU is defined in 2420:ref:`amdgpu-clang-debug-options-table`. 2421 2422 .. table:: AMDGPU Clang Debug Options 2423 :name: amdgpu-clang-debug-options-table 2424 2425 ==================== ================================================== 2426 Debug Flag Description 2427 ==================== ================================================== 2428 -g[no-]embed-source Enable/disable embedding source text in DWARF 2429 debug sections. Useful for environments where 2430 source cannot be written to disk, such as 2431 when performing online compilation. 2432 ==================== ================================================== 2433 2434For example: 2435 2436``-gembed-source`` 2437 Enable the embedded source. 2438 2439``-gno-embed-source`` 2440 Disable the embedded source. 2441 244232-Bit and 64-Bit DWARF Formats 2443------------------------------- 2444 2445See DWARF Version 5 section 7.4 and 2446:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2447 2448For AMDGPU: 2449 2450* For the ``amdgcn`` target architecture only the 64-bit process address space 2451 is supported. 2452 2453* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2454 the 32-bit DWARF format. 2455 2456Unit Headers 2457------------ 2458 2459For AMDGPU the following values apply for each of the unit headers described in 2460DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2461 2462``address_size`` (ubyte) 2463 Matches the address size for the ``Global`` address space defined in 2464 :ref:`amdgpu-dwarf-address-space-identifier`. 2465 2466.. _amdgpu-code-conventions: 2467 2468Code Conventions 2469================ 2470 2471This section provides code conventions used for each supported target triple OS 2472(see :ref:`amdgpu-target-triples`). 2473 2474AMDHSA 2475------ 2476 2477This section provides code conventions used when the target triple OS is 2478``amdhsa`` (see :ref:`amdgpu-target-triples`). 2479 2480.. _amdgpu-amdhsa-code-object-metadata: 2481 2482Code Object Metadata 2483~~~~~~~~~~~~~~~~~~~~ 2484 2485The code object metadata specifies extensible metadata associated with the code 2486objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2487encoding and semantics of this metadata depends on the code object version; see 2488:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2489:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and 2490:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 2491 2492Code object metadata is specified in a note record (see 2493:ref:`amdgpu-note-records`) and is required when the target triple OS is 2494``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2495information necessary to support the HSA compatible runtime kernel queries. For 2496example, the segment sizes needed in a dispatch packet. In addition, a 2497high-level language runtime may require other information to be included. For 2498example, the AMD OpenCL runtime records kernel argument information. 2499 2500.. _amdgpu-amdhsa-code-object-metadata-v2: 2501 2502Code Object V2 Metadata 2503+++++++++++++++++++++++ 2504 2505.. warning:: 2506 Code object V2 is not the default code object version emitted by this version 2507 of LLVM. 2508 2509Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2510(see :ref:`amdgpu-note-records-v2`). 2511 2512The metadata is specified as a YAML formatted string (see [YAML]_ and 2513:doc:`YamlIO`). 2514 2515.. TODO:: 2516 2517 Is the string null terminated? It probably should not if YAML allows it to 2518 contain null characters, otherwise it should be. 2519 2520The metadata is represented as a single YAML document comprised of the mapping 2521defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2522referenced tables. 2523 2524For boolean values, the string values of ``false`` and ``true`` are used for 2525false and true respectively. 2526 2527Additional information can be added to the mappings. To avoid conflicts, any 2528non-AMD key names should be prefixed by "*vendor-name*.". 2529 2530 .. table:: AMDHSA Code Object V2 Metadata Map 2531 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2532 2533 ========== ============== ========= ======================================= 2534 String Key Value Type Required? Description 2535 ========== ============== ========= ======================================= 2536 "Version" sequence of Required - The first integer is the major 2537 2 integers version. Currently 1. 2538 - The second integer is the minor 2539 version. Currently 0. 2540 "Printf" sequence of Each string is encoded information 2541 strings about a printf function call. The 2542 encoded information is organized as 2543 fields separated by colon (':'): 2544 2545 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2546 2547 where: 2548 2549 ``ID`` 2550 A 32-bit integer as a unique id for 2551 each printf function call 2552 2553 ``N`` 2554 A 32-bit integer equal to the number 2555 of arguments of printf function call 2556 minus 1 2557 2558 ``S[i]`` (where i = 0, 1, ... , N-1) 2559 32-bit integers for the size in bytes 2560 of the i-th FormatString argument of 2561 the printf function call 2562 2563 FormatString 2564 The format string passed to the 2565 printf function call. 2566 "Kernels" sequence of Required Sequence of the mappings for each 2567 mapping kernel in the code object. See 2568 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2569 for the definition of the mapping. 2570 ========== ============== ========= ======================================= 2571 2572.. 2573 2574 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2575 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2576 2577 ================= ============== ========= ================================ 2578 String Key Value Type Required? Description 2579 ================= ============== ========= ================================ 2580 "Name" string Required Source name of the kernel. 2581 "SymbolName" string Required Name of the kernel 2582 descriptor ELF symbol. 2583 "Language" string Source language of the kernel. 2584 Values include: 2585 2586 - "OpenCL C" 2587 - "OpenCL C++" 2588 - "HCC" 2589 - "OpenMP" 2590 2591 "LanguageVersion" sequence of - The first integer is the major 2592 2 integers version. 2593 - The second integer is the 2594 minor version. 2595 "Attrs" mapping Mapping of kernel attributes. 2596 See 2597 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2598 for the mapping definition. 2599 "Args" sequence of Sequence of mappings of the 2600 mapping kernel arguments. See 2601 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2602 for the definition of the mapping. 2603 "CodeProps" mapping Mapping of properties related to 2604 the kernel code. See 2605 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2606 for the mapping definition. 2607 ================= ============== ========= ================================ 2608 2609.. 2610 2611 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2612 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2613 2614 =================== ============== ========= ============================== 2615 String Key Value Type Required? Description 2616 =================== ============== ========= ============================== 2617 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2618 3 integers must be >=1 and the dispatch 2619 work-group size X, Y, Z must 2620 correspond to the specified 2621 values. Defaults to 0, 0, 0. 2622 2623 Corresponds to the OpenCL 2624 ``reqd_work_group_size`` 2625 attribute. 2626 "WorkGroupSizeHint" sequence of The dispatch work-group size 2627 3 integers X, Y, Z is likely to be the 2628 specified values. 2629 2630 Corresponds to the OpenCL 2631 ``work_group_size_hint`` 2632 attribute. 2633 "VecTypeHint" string The name of a scalar or vector 2634 type. 2635 2636 Corresponds to the OpenCL 2637 ``vec_type_hint`` attribute. 2638 2639 "RuntimeHandle" string The external symbol name 2640 associated with a kernel. 2641 OpenCL runtime allocates a 2642 global buffer for the symbol 2643 and saves the kernel's address 2644 to it, which is used for 2645 device side enqueueing. Only 2646 available for device side 2647 enqueued kernels. 2648 =================== ============== ========= ============================== 2649 2650.. 2651 2652 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2653 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2654 2655 ================= ============== ========= ================================ 2656 String Key Value Type Required? Description 2657 ================= ============== ========= ================================ 2658 "Name" string Kernel argument name. 2659 "TypeName" string Kernel argument type name. 2660 "Size" integer Required Kernel argument size in bytes. 2661 "Align" integer Required Kernel argument alignment in 2662 bytes. Must be a power of two. 2663 "ValueKind" string Required Kernel argument kind that 2664 specifies how to set up the 2665 corresponding argument. 2666 Values include: 2667 2668 "ByValue" 2669 The argument is copied 2670 directly into the kernarg. 2671 2672 "GlobalBuffer" 2673 A global address space pointer 2674 to the buffer data is passed 2675 in the kernarg. 2676 2677 "DynamicSharedPointer" 2678 A group address space pointer 2679 to dynamically allocated LDS 2680 is passed in the kernarg. 2681 2682 "Sampler" 2683 A global address space 2684 pointer to a S# is passed in 2685 the kernarg. 2686 2687 "Image" 2688 A global address space 2689 pointer to a T# is passed in 2690 the kernarg. 2691 2692 "Pipe" 2693 A global address space pointer 2694 to an OpenCL pipe is passed in 2695 the kernarg. 2696 2697 "Queue" 2698 A global address space pointer 2699 to an OpenCL device enqueue 2700 queue is passed in the 2701 kernarg. 2702 2703 "HiddenGlobalOffsetX" 2704 The OpenCL grid dispatch 2705 global offset for the X 2706 dimension is passed in the 2707 kernarg. 2708 2709 "HiddenGlobalOffsetY" 2710 The OpenCL grid dispatch 2711 global offset for the Y 2712 dimension is passed in the 2713 kernarg. 2714 2715 "HiddenGlobalOffsetZ" 2716 The OpenCL grid dispatch 2717 global offset for the Z 2718 dimension is passed in the 2719 kernarg. 2720 2721 "HiddenNone" 2722 An argument that is not used 2723 by the kernel. Space needs to 2724 be left for it, but it does 2725 not need to be set up. 2726 2727 "HiddenPrintfBuffer" 2728 A global address space pointer 2729 to the runtime printf buffer 2730 is passed in kernarg. 2731 2732 "HiddenHostcallBuffer" 2733 A global address space pointer 2734 to the runtime hostcall buffer 2735 is passed in kernarg. 2736 2737 "HiddenDefaultQueue" 2738 A global address space pointer 2739 to the OpenCL device enqueue 2740 queue that should be used by 2741 the kernel by default is 2742 passed in the kernarg. 2743 2744 "HiddenCompletionAction" 2745 A global address space pointer 2746 to help link enqueued kernels into 2747 the ancestor tree for determining 2748 when the parent kernel has finished. 2749 2750 "HiddenMultiGridSyncArg" 2751 A global address space pointer for 2752 multi-grid synchronization is 2753 passed in the kernarg. 2754 2755 "ValueType" string Unused and deprecated. This should no longer 2756 be emitted, but is accepted for compatibility. 2757 2758 2759 "PointeeAlign" integer Alignment in bytes of pointee 2760 type for pointer type kernel 2761 argument. Must be a power 2762 of 2. Only present if 2763 "ValueKind" is 2764 "DynamicSharedPointer". 2765 "AddrSpaceQual" string Kernel argument address space 2766 qualifier. Only present if 2767 "ValueKind" is "GlobalBuffer" or 2768 "DynamicSharedPointer". Values 2769 are: 2770 2771 - "Private" 2772 - "Global" 2773 - "Constant" 2774 - "Local" 2775 - "Generic" 2776 - "Region" 2777 2778 .. TODO:: 2779 2780 Is GlobalBuffer only Global 2781 or Constant? Is 2782 DynamicSharedPointer always 2783 Local? Can HCC allow Generic? 2784 How can Private or Region 2785 ever happen? 2786 2787 "AccQual" string Kernel argument access 2788 qualifier. Only present if 2789 "ValueKind" is "Image" or 2790 "Pipe". Values 2791 are: 2792 2793 - "ReadOnly" 2794 - "WriteOnly" 2795 - "ReadWrite" 2796 2797 .. TODO:: 2798 2799 Does this apply to 2800 GlobalBuffer? 2801 2802 "ActualAccQual" string The actual memory accesses 2803 performed by the kernel on the 2804 kernel argument. Only present if 2805 "ValueKind" is "GlobalBuffer", 2806 "Image", or "Pipe". This may be 2807 more restrictive than indicated 2808 by "AccQual" to reflect what the 2809 kernel actual does. If not 2810 present then the runtime must 2811 assume what is implied by 2812 "AccQual" and "IsConst". Values 2813 are: 2814 2815 - "ReadOnly" 2816 - "WriteOnly" 2817 - "ReadWrite" 2818 2819 "IsConst" boolean Indicates if the kernel argument 2820 is const qualified. Only present 2821 if "ValueKind" is 2822 "GlobalBuffer". 2823 2824 "IsRestrict" boolean Indicates if the kernel argument 2825 is restrict qualified. Only 2826 present if "ValueKind" is 2827 "GlobalBuffer". 2828 2829 "IsVolatile" boolean Indicates if the kernel argument 2830 is volatile qualified. Only 2831 present if "ValueKind" is 2832 "GlobalBuffer". 2833 2834 "IsPipe" boolean Indicates if the kernel argument 2835 is pipe qualified. Only present 2836 if "ValueKind" is "Pipe". 2837 2838 .. TODO:: 2839 2840 Can GlobalBuffer be pipe 2841 qualified? 2842 2843 ================= ============== ========= ================================ 2844 2845.. 2846 2847 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2848 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2849 2850 ============================ ============== ========= ===================== 2851 String Key Value Type Required? Description 2852 ============================ ============== ========= ===================== 2853 "KernargSegmentSize" integer Required The size in bytes of 2854 the kernarg segment 2855 that holds the values 2856 of the arguments to 2857 the kernel. 2858 "GroupSegmentFixedSize" integer Required The amount of group 2859 segment memory 2860 required by a 2861 work-group in 2862 bytes. This does not 2863 include any 2864 dynamically allocated 2865 group segment memory 2866 that may be added 2867 when the kernel is 2868 dispatched. 2869 "PrivateSegmentFixedSize" integer Required The amount of fixed 2870 private address space 2871 memory required for a 2872 work-item in 2873 bytes. If the kernel 2874 uses a dynamic call 2875 stack then additional 2876 space must be added 2877 to this value for the 2878 call stack. 2879 "KernargSegmentAlign" integer Required The maximum byte 2880 alignment of 2881 arguments in the 2882 kernarg segment. Must 2883 be a power of 2. 2884 "WavefrontSize" integer Required Wavefront size. Must 2885 be a power of 2. 2886 "NumSGPRs" integer Required Number of scalar 2887 registers used by a 2888 wavefront for 2889 GFX6-GFX10. This 2890 includes the special 2891 SGPRs for VCC, Flat 2892 Scratch (GFX7-GFX10) 2893 and XNACK (for 2894 GFX8-GFX10). It does 2895 not include the 16 2896 SGPR added if a trap 2897 handler is 2898 enabled. It is not 2899 rounded up to the 2900 allocation 2901 granularity. 2902 "NumVGPRs" integer Required Number of vector 2903 registers used by 2904 each work-item for 2905 GFX6-GFX10 2906 "MaxFlatWorkGroupSize" integer Required Maximum flat 2907 work-group size 2908 supported by the 2909 kernel in work-items. 2910 Must be >=1 and 2911 consistent with 2912 ReqdWorkGroupSize if 2913 not 0, 0, 0. 2914 "NumSpilledSGPRs" integer Number of stores from 2915 a scalar register to 2916 a register allocator 2917 created spill 2918 location. 2919 "NumSpilledVGPRs" integer Number of stores from 2920 a vector register to 2921 a register allocator 2922 created spill 2923 location. 2924 ============================ ============== ========= ===================== 2925 2926.. _amdgpu-amdhsa-code-object-metadata-v3: 2927 2928Code Object V3 Metadata 2929+++++++++++++++++++++++ 2930 2931Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note 2932record (see :ref:`amdgpu-note-records-v3-v4`). 2933 2934The metadata is represented as Message Pack formatted binary data (see 2935[MsgPack]_). The top level is a Message Pack map that includes the 2936keys defined in table 2937:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 2938tables. 2939 2940Additional information can be added to the maps. To avoid conflicts, 2941any key names should be prefixed by "*vendor-name*." where 2942``vendor-name`` can be the name of the vendor and specific vendor 2943tool that generates the information. The prefix is abbreviated to 2944simply "." when it appears within a map that has been added by the 2945same *vendor-name*. 2946 2947 .. table:: AMDHSA Code Object V3 Metadata Map 2948 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 2949 2950 ================= ============== ========= ======================================= 2951 String Key Value Type Required? Description 2952 ================= ============== ========= ======================================= 2953 "amdhsa.version" sequence of Required - The first integer is the major 2954 2 integers version. Currently 1. 2955 - The second integer is the minor 2956 version. Currently 0. 2957 "amdhsa.printf" sequence of Each string is encoded information 2958 strings about a printf function call. The 2959 encoded information is organized as 2960 fields separated by colon (':'): 2961 2962 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2963 2964 where: 2965 2966 ``ID`` 2967 A 32-bit integer as a unique id for 2968 each printf function call 2969 2970 ``N`` 2971 A 32-bit integer equal to the number 2972 of arguments of printf function call 2973 minus 1 2974 2975 ``S[i]`` (where i = 0, 1, ... , N-1) 2976 32-bit integers for the size in bytes 2977 of the i-th FormatString argument of 2978 the printf function call 2979 2980 FormatString 2981 The format string passed to the 2982 printf function call. 2983 "amdhsa.kernels" sequence of Required Sequence of the maps for each 2984 map kernel in the code object. See 2985 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 2986 for the definition of the keys included 2987 in that map. 2988 ================= ============== ========= ======================================= 2989 2990.. 2991 2992 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 2993 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 2994 2995 =================================== ============== ========= ================================ 2996 String Key Value Type Required? Description 2997 =================================== ============== ========= ================================ 2998 ".name" string Required Source name of the kernel. 2999 ".symbol" string Required Name of the kernel 3000 descriptor ELF symbol. 3001 ".language" string Source language of the kernel. 3002 Values include: 3003 3004 - "OpenCL C" 3005 - "OpenCL C++" 3006 - "HCC" 3007 - "HIP" 3008 - "OpenMP" 3009 - "Assembler" 3010 3011 ".language_version" sequence of - The first integer is the major 3012 2 integers version. 3013 - The second integer is the 3014 minor version. 3015 ".args" sequence of Sequence of maps of the 3016 map kernel arguments. See 3017 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3018 for the definition of the keys 3019 included in that map. 3020 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3021 3 integers must be >=1 and the dispatch 3022 work-group size X, Y, Z must 3023 correspond to the specified 3024 values. Defaults to 0, 0, 0. 3025 3026 Corresponds to the OpenCL 3027 ``reqd_work_group_size`` 3028 attribute. 3029 ".workgroup_size_hint" sequence of The dispatch work-group size 3030 3 integers X, Y, Z is likely to be the 3031 specified values. 3032 3033 Corresponds to the OpenCL 3034 ``work_group_size_hint`` 3035 attribute. 3036 ".vec_type_hint" string The name of a scalar or vector 3037 type. 3038 3039 Corresponds to the OpenCL 3040 ``vec_type_hint`` attribute. 3041 3042 ".device_enqueue_symbol" string The external symbol name 3043 associated with a kernel. 3044 OpenCL runtime allocates a 3045 global buffer for the symbol 3046 and saves the kernel's address 3047 to it, which is used for 3048 device side enqueueing. Only 3049 available for device side 3050 enqueued kernels. 3051 ".kernarg_segment_size" integer Required The size in bytes of 3052 the kernarg segment 3053 that holds the values 3054 of the arguments to 3055 the kernel. 3056 ".group_segment_fixed_size" integer Required The amount of group 3057 segment memory 3058 required by a 3059 work-group in 3060 bytes. This does not 3061 include any 3062 dynamically allocated 3063 group segment memory 3064 that may be added 3065 when the kernel is 3066 dispatched. 3067 ".private_segment_fixed_size" integer Required The amount of fixed 3068 private address space 3069 memory required for a 3070 work-item in 3071 bytes. If the kernel 3072 uses a dynamic call 3073 stack then additional 3074 space must be added 3075 to this value for the 3076 call stack. 3077 ".kernarg_segment_align" integer Required The maximum byte 3078 alignment of 3079 arguments in the 3080 kernarg segment. Must 3081 be a power of 2. 3082 ".wavefront_size" integer Required Wavefront size. Must 3083 be a power of 2. 3084 ".sgpr_count" integer Required Number of scalar 3085 registers required by a 3086 wavefront for 3087 GFX6-GFX9. A register 3088 is required if it is 3089 used explicitly, or 3090 if a higher numbered 3091 register is used 3092 explicitly. This 3093 includes the special 3094 SGPRs for VCC, Flat 3095 Scratch (GFX7-GFX9) 3096 and XNACK (for 3097 GFX8-GFX9). It does 3098 not include the 16 3099 SGPR added if a trap 3100 handler is 3101 enabled. It is not 3102 rounded up to the 3103 allocation 3104 granularity. 3105 ".vgpr_count" integer Required Number of vector 3106 registers required by 3107 each work-item for 3108 GFX6-GFX9. A register 3109 is required if it is 3110 used explicitly, or 3111 if a higher numbered 3112 register is used 3113 explicitly. 3114 ".max_flat_workgroup_size" integer Required Maximum flat 3115 work-group size 3116 supported by the 3117 kernel in work-items. 3118 Must be >=1 and 3119 consistent with 3120 ReqdWorkGroupSize if 3121 not 0, 0, 0. 3122 ".sgpr_spill_count" integer Number of stores from 3123 a scalar register to 3124 a register allocator 3125 created spill 3126 location. 3127 ".vgpr_spill_count" integer Number of stores from 3128 a vector register to 3129 a register allocator 3130 created spill 3131 location. 3132 =================================== ============== ========= ================================ 3133 3134.. 3135 3136 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3137 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3138 3139 ====================== ============== ========= ================================ 3140 String Key Value Type Required? Description 3141 ====================== ============== ========= ================================ 3142 ".name" string Kernel argument name. 3143 ".type_name" string Kernel argument type name. 3144 ".size" integer Required Kernel argument size in bytes. 3145 ".offset" integer Required Kernel argument offset in 3146 bytes. The offset must be a 3147 multiple of the alignment 3148 required by the argument. 3149 ".value_kind" string Required Kernel argument kind that 3150 specifies how to set up the 3151 corresponding argument. 3152 Values include: 3153 3154 "by_value" 3155 The argument is copied 3156 directly into the kernarg. 3157 3158 "global_buffer" 3159 A global address space pointer 3160 to the buffer data is passed 3161 in the kernarg. 3162 3163 "dynamic_shared_pointer" 3164 A group address space pointer 3165 to dynamically allocated LDS 3166 is passed in the kernarg. 3167 3168 "sampler" 3169 A global address space 3170 pointer to a S# is passed in 3171 the kernarg. 3172 3173 "image" 3174 A global address space 3175 pointer to a T# is passed in 3176 the kernarg. 3177 3178 "pipe" 3179 A global address space pointer 3180 to an OpenCL pipe is passed in 3181 the kernarg. 3182 3183 "queue" 3184 A global address space pointer 3185 to an OpenCL device enqueue 3186 queue is passed in the 3187 kernarg. 3188 3189 "hidden_global_offset_x" 3190 The OpenCL grid dispatch 3191 global offset for the X 3192 dimension is passed in the 3193 kernarg. 3194 3195 "hidden_global_offset_y" 3196 The OpenCL grid dispatch 3197 global offset for the Y 3198 dimension is passed in the 3199 kernarg. 3200 3201 "hidden_global_offset_z" 3202 The OpenCL grid dispatch 3203 global offset for the Z 3204 dimension is passed in the 3205 kernarg. 3206 3207 "hidden_none" 3208 An argument that is not used 3209 by the kernel. Space needs to 3210 be left for it, but it does 3211 not need to be set up. 3212 3213 "hidden_printf_buffer" 3214 A global address space pointer 3215 to the runtime printf buffer 3216 is passed in kernarg. 3217 3218 "hidden_hostcall_buffer" 3219 A global address space pointer 3220 to the runtime hostcall buffer 3221 is passed in kernarg. 3222 3223 "hidden_default_queue" 3224 A global address space pointer 3225 to the OpenCL device enqueue 3226 queue that should be used by 3227 the kernel by default is 3228 passed in the kernarg. 3229 3230 "hidden_completion_action" 3231 A global address space pointer 3232 to help link enqueued kernels into 3233 the ancestor tree for determining 3234 when the parent kernel has finished. 3235 3236 "hidden_multigrid_sync_arg" 3237 A global address space pointer for 3238 multi-grid synchronization is 3239 passed in the kernarg. 3240 3241 ".value_type" string Unused and deprecated. This should no longer 3242 be emitted, but is accepted for compatibility. 3243 3244 ".pointee_align" integer Alignment in bytes of pointee 3245 type for pointer type kernel 3246 argument. Must be a power 3247 of 2. Only present if 3248 ".value_kind" is 3249 "dynamic_shared_pointer". 3250 ".address_space" string Kernel argument address space 3251 qualifier. Only present if 3252 ".value_kind" is "global_buffer" or 3253 "dynamic_shared_pointer". Values 3254 are: 3255 3256 - "private" 3257 - "global" 3258 - "constant" 3259 - "local" 3260 - "generic" 3261 - "region" 3262 3263 .. TODO:: 3264 3265 Is "global_buffer" only "global" 3266 or "constant"? Is 3267 "dynamic_shared_pointer" always 3268 "local"? Can HCC allow "generic"? 3269 How can "private" or "region" 3270 ever happen? 3271 3272 ".access" string Kernel argument access 3273 qualifier. Only present if 3274 ".value_kind" is "image" or 3275 "pipe". Values 3276 are: 3277 3278 - "read_only" 3279 - "write_only" 3280 - "read_write" 3281 3282 .. TODO:: 3283 3284 Does this apply to 3285 "global_buffer"? 3286 3287 ".actual_access" string The actual memory accesses 3288 performed by the kernel on the 3289 kernel argument. Only present if 3290 ".value_kind" is "global_buffer", 3291 "image", or "pipe". This may be 3292 more restrictive than indicated 3293 by ".access" to reflect what the 3294 kernel actual does. If not 3295 present then the runtime must 3296 assume what is implied by 3297 ".access" and ".is_const" . Values 3298 are: 3299 3300 - "read_only" 3301 - "write_only" 3302 - "read_write" 3303 3304 ".is_const" boolean Indicates if the kernel argument 3305 is const qualified. Only present 3306 if ".value_kind" is 3307 "global_buffer". 3308 3309 ".is_restrict" boolean Indicates if the kernel argument 3310 is restrict qualified. Only 3311 present if ".value_kind" is 3312 "global_buffer". 3313 3314 ".is_volatile" boolean Indicates if the kernel argument 3315 is volatile qualified. Only 3316 present if ".value_kind" is 3317 "global_buffer". 3318 3319 ".is_pipe" boolean Indicates if the kernel argument 3320 is pipe qualified. Only present 3321 if ".value_kind" is "pipe". 3322 3323 .. TODO:: 3324 3325 Can "global_buffer" be pipe 3326 qualified? 3327 3328 ====================== ============== ========= ================================ 3329 3330.. _amdgpu-amdhsa-code-object-metadata-v4: 3331 3332Code Object V4 Metadata 3333+++++++++++++++++++++++ 3334 3335.. warning:: 3336 Code object V4 is not the default code object version emitted by this version 3337 of LLVM. 3338 3339Code object V4 metadata is the same as 3340:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3341defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`. 3342 3343 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3` 3344 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3345 3346 ================= ============== ========= ======================================= 3347 String Key Value Type Required? Description 3348 ================= ============== ========= ======================================= 3349 "amdhsa.version" sequence of Required - The first integer is the major 3350 2 integers version. Currently 1. 3351 - The second integer is the minor 3352 version. Currently 1. 3353 "amdhsa.target" string Required The target name of the code using the syntax: 3354 3355 .. code:: 3356 3357 <target-triple> [ "-" <target-id> ] 3358 3359 A canonical target ID must be 3360 used. See :ref:`amdgpu-target-triples` 3361 and :ref:`amdgpu-target-id`. 3362 ================= ============== ========= ======================================= 3363 3364.. 3365 3366Kernel Dispatch 3367~~~~~~~~~~~~~~~ 3368 3369The HSA architected queuing language (AQL) defines a user space memory interface 3370that can be used to control the dispatch of kernels, in an agent independent 3371way. An agent can have zero or more AQL queues created for it using an HSA 3372compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3373are 64 bytes) can be placed. See the *HSA Platform System Architecture 3374Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3375 3376The packet processor of a kernel agent is responsible for detecting and 3377dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3378packet processor is implemented by the hardware command processor (CP), 3379asynchronous dispatch controller (ADC) and shader processor input controller 3380(SPI). 3381 3382An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3383the kernel mode driver to initialize and register the AQL queue with CP. 3384 3385To dispatch a kernel the following actions are performed. This can occur in the 3386CPU host program, or from an HSA kernel executing on a GPU. 3387 33881. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3389 executed is obtained. 33902. A pointer to the kernel descriptor (see 3391 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3392 It must be for a kernel that is contained in a code object that that was 3393 loaded by an HSA compatible runtime on the kernel agent with which the AQL 3394 queue is associated. 33953. Space is allocated for the kernel arguments using the HSA compatible runtime 3396 allocator for a memory region with the kernarg property for the kernel agent 3397 that will execute the kernel. It must be at least 16-byte aligned. 33984. Kernel argument values are assigned to the kernel argument memory 3399 allocation. The layout is defined in the *HSA Programmer's Language 3400 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3401 kernel argument memory in the same way constant memory is accessed. (Note 3402 that the HSA specification allows an implementation to copy the kernel 3403 argument contents to another location that is accessed by the kernel.) 34045. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3405 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3406 for the packet. The packet must be set up, and the final write must use an 3407 atomic store release to set the packet kind to ensure the packet contents are 3408 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3409 notify the kernel agent that the AQL queue has been updated. These rules, and 3410 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3411 System Architecture Specification* [HSA]_. 34126. A kernel dispatch packet includes information about the actual dispatch, 3413 such as grid and work-group size, together with information from the code 3414 object about the kernel, such as segment sizes. The HSA compatible runtime 3415 queries on the kernel symbol can be used to obtain the code object values 3416 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 34177. CP executes micro-code and is responsible for detecting and setting up the 3418 GPU to execute the wavefronts of a kernel dispatch. 34198. CP ensures that when the a wavefront starts executing the kernel machine 3420 code, the scalar general purpose registers (SGPR) and vector general purpose 3421 registers (VGPR) are set up as required by the machine code. The required 3422 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3423 register state is defined in 3424 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 34259. The prolog of the kernel machine code (see 3426 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3427 before continuing executing the machine code that corresponds to the kernel. 342810. When the kernel dispatch has completed execution, CP signals the completion 3429 signal specified in the kernel dispatch packet if not 0. 3430 3431.. _amdgpu-amdhsa-memory-spaces: 3432 3433Memory Spaces 3434~~~~~~~~~~~~~ 3435 3436The memory space properties are: 3437 3438 .. table:: AMDHSA Memory Spaces 3439 :name: amdgpu-amdhsa-memory-spaces-table 3440 3441 ================= =========== ======== ======= ================== 3442 Memory Space Name HSA Segment Hardware Address NULL Value 3443 Name Name Size 3444 ================= =========== ======== ======= ================== 3445 Private private scratch 32 0x00000000 3446 Local group LDS 32 0xFFFFFFFF 3447 Global global global 64 0x0000000000000000 3448 Constant constant *same as 64 0x0000000000000000 3449 global* 3450 Generic flat flat 64 0x0000000000000000 3451 Region N/A GDS 32 *not implemented 3452 for AMDHSA* 3453 ================= =========== ======== ======= ================== 3454 3455The global and constant memory spaces both use global virtual addresses, which 3456are the same virtual address space used by the CPU. However, some virtual 3457addresses may only be accessible to the CPU, some only accessible by the GPU, 3458and some by both. 3459 3460Using the constant memory space indicates that the data will not change during 3461the execution of the kernel. This allows scalar read instructions to be 3462used. The vector and scalar L1 caches are invalidated of volatile data before 3463each kernel dispatch execution to allow constant memory to change values between 3464kernel dispatches. 3465 3466The local memory space uses the hardware Local Data Store (LDS) which is 3467automatically allocated when the hardware creates work-groups of wavefronts, and 3468freed when all the wavefronts of a work-group have terminated. The data store 3469(DS) instructions can be used to access it. 3470 3471The private memory space uses the hardware scratch memory support. If the kernel 3472uses scratch, then the hardware allocates memory that is accessed using 3473wavefront lane dword (4 byte) interleaving. The mapping used from private 3474address to physical address is: 3475 3476 ``wavefront-scratch-base + 3477 (private-address * wavefront-size * 4) + 3478 (wavefront-lane-id * 4)`` 3479 3480There are different ways that the wavefront scratch base address is determined 3481by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3482memory can be accessed in an interleaved manner using buffer instruction with 3483the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3484instructions, or by flat instructions. If each lane of a wavefront accesses the 3485same private address, the interleaving results in adjacent dwords being accessed 3486and hence requires fewer cache lines to be fetched. Multi-dword access is not 3487supported except by flat and scratch instructions in GFX9-GFX10. 3488 3489The generic address space uses the hardware flat address support available in 3490GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3491local apertures), that are outside the range of addressible global memory, to 3492map from a flat address to a private or local address. 3493 3494FLAT instructions can take a flat address and access global, private (scratch) 3495and group (LDS) memory depending in if the address is within one of the 3496aperture ranges. Flat access to scratch requires hardware aperture setup and 3497setup in the kernel prologue (see 3498:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3499hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3500:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3501 3502To convert between a segment address and a flat address the base address of the 3503apertures address can be used. For GFX7-GFX8 these are available in the 3504:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3505Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3506GFX9-GFX10 the aperture base addresses are directly available as inline constant 3507registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3508address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3509which makes it easier to convert from flat to segment or segment to flat. 3510 3511Image and Samplers 3512~~~~~~~~~~~~~~~~~~ 3513 3514Image and sample handles created by an HSA compatible runtime (see 3515:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3516object respectively. In order to support the HSA ``query_sampler`` operations 3517two extra dwords are used to store the HSA BRIG enumeration values for the 3518queries that are not trivially deducible from the S# representation. 3519 3520HSA Signals 3521~~~~~~~~~~~ 3522 3523HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3524are 64-bit addresses of a structure allocated in memory accessible from both the 3525CPU and GPU. The structure is defined by the runtime and subject to change 3526between releases. For example, see [AMD-ROCm-github]_. 3527 3528.. _amdgpu-amdhsa-hsa-aql-queue: 3529 3530HSA AQL Queue 3531~~~~~~~~~~~~~ 3532 3533The HSA AQL queue structure is defined by an HSA compatible runtime (see 3534:ref:`amdgpu-os`) and subject to change between releases. For example, see 3535[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3536certain language features such as the flat address aperture bases. It also 3537contains fields used by CP such as managing the allocation of scratch memory. 3538 3539.. _amdgpu-amdhsa-kernel-descriptor: 3540 3541Kernel Descriptor 3542~~~~~~~~~~~~~~~~~ 3543 3544A kernel descriptor consists of the information needed by CP to initiate the 3545execution of a kernel, including the entry point address of the machine code 3546that implements the kernel. 3547 3548Code Object V3 Kernel Descriptor 3549++++++++++++++++++++++++++++++++ 3550 3551CP microcode requires the Kernel descriptor to be allocated on 64-byte 3552alignment. 3553 3554The fields used by CP for code objects before V3 also match those specified in 3555:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3556 3557 .. table:: Code Object V3 Kernel Descriptor 3558 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3559 3560 ======= ======= =============================== ============================ 3561 Bits Size Field Name Description 3562 ======= ======= =============================== ============================ 3563 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3564 address space memory 3565 required for a work-group 3566 in bytes. This does not 3567 include any dynamically 3568 allocated local address 3569 space memory that may be 3570 added when the kernel is 3571 dispatched. 3572 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3573 private address space 3574 memory required for a 3575 work-item in bytes. 3576 Additional space may need to 3577 be added to this value if 3578 the call stack has 3579 non-inlined function calls. 3580 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3581 memory pointed to by the 3582 AQL dispatch packet. The 3583 kernarg memory is used to 3584 pass arguments to the 3585 kernel. 3586 3587 * If the kernarg pointer in 3588 the dispatch packet is NULL 3589 then there are no kernel 3590 arguments. 3591 * If the kernarg pointer in 3592 the dispatch packet is 3593 not NULL and this value 3594 is 0 then the kernarg 3595 memory size is 3596 unspecified. 3597 * If the kernarg pointer in 3598 the dispatch packet is 3599 not NULL and this value 3600 is not 0 then the value 3601 specifies the kernarg 3602 memory size in bytes. It 3603 is recommended to provide 3604 a value as it may be used 3605 by CP to optimize making 3606 the kernarg memory 3607 visible to the kernel 3608 code. 3609 3610 127:96 4 bytes Reserved, must be 0. 3611 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3612 negative) from base 3613 address of kernel 3614 descriptor to kernel's 3615 entry point instruction 3616 which must be 256 byte 3617 aligned. 3618 351:272 20 Reserved, must be 0. 3619 bytes 3620 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3621 Reserved, must be 0. 3622 GFX90A 3623 Compute Shader (CS) 3624 program settings used by 3625 CP to set up 3626 ``COMPUTE_PGM_RSRC3`` 3627 configuration 3628 register. See 3629 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3630 GFX10 3631 Compute Shader (CS) 3632 program settings used by 3633 CP to set up 3634 ``COMPUTE_PGM_RSRC3`` 3635 configuration 3636 register. See 3637 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3638 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3639 program settings used by 3640 CP to set up 3641 ``COMPUTE_PGM_RSRC1`` 3642 configuration 3643 register. See 3644 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3645 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3646 program settings used by 3647 CP to set up 3648 ``COMPUTE_PGM_RSRC2`` 3649 configuration 3650 register. See 3651 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3652 458:448 7 bits *See separate bits below.* Enable the setup of the 3653 SGPR user data registers 3654 (see 3655 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3656 3657 The total number of SGPR 3658 user data registers 3659 requested must not exceed 3660 16 and match value in 3661 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3662 Any requests beyond 16 3663 will be ignored. 3664 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3665 _BUFFER 3666 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3667 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3668 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3669 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3670 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT 3671 3672 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3673 _SIZE 3674 457:455 3 bits Reserved, must be 0. 3675 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3676 Reserved, must be 0. 3677 GFX10 3678 - If 0 execute in 3679 wavefront size 64 mode. 3680 - If 1 execute in 3681 native wavefront size 3682 32 mode. 3683 463:459 1 bit Reserved, must be 0. 3684 464 1 bit RESERVED_464 Deprecated, must be 0. 3685 467:465 3 bits Reserved, must be 0. 3686 468 1 bit RESERVED_468 Deprecated, must be 0. 3687 469:471 3 bits Reserved, must be 0. 3688 511:472 5 bytes Reserved, must be 0. 3689 512 **Total size 64 bytes.** 3690 ======= ==================================================================== 3691 3692.. 3693 3694 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3695 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3696 3697 ======= ======= =============================== =========================================================================== 3698 Bits Size Field Name Description 3699 ======= ======= =============================== =========================================================================== 3700 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3701 blocks used by each work-item; 3702 granularity is device 3703 specific: 3704 3705 GFX6-GFX9 3706 - vgprs_used 0..256 3707 - max(0, ceil(vgprs_used / 4) - 1) 3708 GFX90A 3709 - vgprs_used 0..512 3710 - vgprs_used = align(arch_vgprs, 4) 3711 + acc_vgprs 3712 - max(0, ceil(vgprs_used / 8) - 1) 3713 GFX10 (wavefront size 64) 3714 - max_vgpr 1..256 3715 - max(0, ceil(vgprs_used / 4) - 1) 3716 GFX10 (wavefront size 32) 3717 - max_vgpr 1..256 3718 - max(0, ceil(vgprs_used / 8) - 1) 3719 3720 Where vgprs_used is defined 3721 as the highest VGPR number 3722 explicitly referenced plus 3723 one. 3724 3725 Used by CP to set up 3726 ``COMPUTE_PGM_RSRC1.VGPRS``. 3727 3728 The 3729 :ref:`amdgpu-assembler` 3730 calculates this 3731 automatically for the 3732 selected processor from 3733 values provided to the 3734 `.amdhsa_kernel` directive 3735 by the 3736 `.amdhsa_next_free_vgpr` 3737 nested directive (see 3738 :ref:`amdhsa-kernel-directives-table`). 3739 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3740 blocks used by a wavefront; 3741 granularity is device 3742 specific: 3743 3744 GFX6-GFX8 3745 - sgprs_used 0..112 3746 - max(0, ceil(sgprs_used / 8) - 1) 3747 GFX9 3748 - sgprs_used 0..112 3749 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3750 GFX10 3751 Reserved, must be 0. 3752 (128 SGPRs always 3753 allocated.) 3754 3755 Where sgprs_used is 3756 defined as the highest 3757 SGPR number explicitly 3758 referenced plus one, plus 3759 a target specific number 3760 of additional special 3761 SGPRs for VCC, 3762 FLAT_SCRATCH (GFX7+) and 3763 XNACK_MASK (GFX8+), and 3764 any additional 3765 target specific 3766 limitations. It does not 3767 include the 16 SGPRs added 3768 if a trap handler is 3769 enabled. 3770 3771 The target specific 3772 limitations and special 3773 SGPR layout are defined in 3774 the hardware 3775 documentation, which can 3776 be found in the 3777 :ref:`amdgpu-processors` 3778 table. 3779 3780 Used by CP to set up 3781 ``COMPUTE_PGM_RSRC1.SGPRS``. 3782 3783 The 3784 :ref:`amdgpu-assembler` 3785 calculates this 3786 automatically for the 3787 selected processor from 3788 values provided to the 3789 `.amdhsa_kernel` directive 3790 by the 3791 `.amdhsa_next_free_sgpr` 3792 and `.amdhsa_reserve_*` 3793 nested directives (see 3794 :ref:`amdhsa-kernel-directives-table`). 3795 11:10 2 bits PRIORITY Must be 0. 3796 3797 Start executing wavefront 3798 at the specified priority. 3799 3800 CP is responsible for 3801 filling in 3802 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3803 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3804 with specified rounding 3805 mode for single (32 3806 bit) floating point 3807 precision floating point 3808 operations. 3809 3810 Floating point rounding 3811 mode values are defined in 3812 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3813 3814 Used by CP to set up 3815 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3816 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3817 with specified rounding 3818 denorm mode for half/double (16 3819 and 64-bit) floating point 3820 precision floating point 3821 operations. 3822 3823 Floating point rounding 3824 mode values are defined in 3825 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3826 3827 Used by CP to set up 3828 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3829 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3830 with specified denorm mode 3831 for single (32 3832 bit) floating point 3833 precision floating point 3834 operations. 3835 3836 Floating point denorm mode 3837 values are defined in 3838 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3839 3840 Used by CP to set up 3841 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3842 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3843 with specified denorm mode 3844 for half/double (16 3845 and 64-bit) floating point 3846 precision floating point 3847 operations. 3848 3849 Floating point denorm mode 3850 values are defined in 3851 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3852 3853 Used by CP to set up 3854 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3855 20 1 bit PRIV Must be 0. 3856 3857 Start executing wavefront 3858 in privilege trap handler 3859 mode. 3860 3861 CP is responsible for 3862 filling in 3863 ``COMPUTE_PGM_RSRC1.PRIV``. 3864 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3865 with DX10 clamp mode 3866 enabled. Used by the vector 3867 ALU to force DX10 style 3868 treatment of NaN's (when 3869 set, clamp NaN to zero, 3870 otherwise pass NaN 3871 through). 3872 3873 Used by CP to set up 3874 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3875 22 1 bit DEBUG_MODE Must be 0. 3876 3877 Start executing wavefront 3878 in single step mode. 3879 3880 CP is responsible for 3881 filling in 3882 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3883 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3884 with IEEE mode 3885 enabled. Floating point 3886 opcodes that support 3887 exception flag gathering 3888 will quiet and propagate 3889 signaling-NaN inputs per 3890 IEEE 754-2008. Min_dx10 and 3891 max_dx10 become IEEE 3892 754-2008 compliant due to 3893 signaling-NaN propagation 3894 and quieting. 3895 3896 Used by CP to set up 3897 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3898 24 1 bit BULKY Must be 0. 3899 3900 Only one work-group allowed 3901 to execute on a compute 3902 unit. 3903 3904 CP is responsible for 3905 filling in 3906 ``COMPUTE_PGM_RSRC1.BULKY``. 3907 25 1 bit CDBG_USER Must be 0. 3908 3909 Flag that can be used to 3910 control debugging code. 3911 3912 CP is responsible for 3913 filling in 3914 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 3915 26 1 bit FP16_OVFL GFX6-GFX8 3916 Reserved, must be 0. 3917 GFX9-GFX10 3918 Wavefront starts execution 3919 with specified fp16 overflow 3920 mode. 3921 3922 - If 0, fp16 overflow generates 3923 +/-INF values. 3924 - If 1, fp16 overflow that is the 3925 result of an +/-INF input value 3926 or divide by 0 produces a +/-INF, 3927 otherwise clamps computed 3928 overflow to +/-MAX_FP16 as 3929 appropriate. 3930 3931 Used by CP to set up 3932 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 3933 28:27 2 bits Reserved, must be 0. 3934 29 1 bit WGP_MODE GFX6-GFX9 3935 Reserved, must be 0. 3936 GFX10 3937 - If 0 execute work-groups in 3938 CU wavefront execution mode. 3939 - If 1 execute work-groups on 3940 in WGP wavefront execution mode. 3941 3942 See :ref:`amdgpu-amdhsa-memory-model`. 3943 3944 Used by CP to set up 3945 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 3946 30 1 bit MEM_ORDERED GFX6-GFX9 3947 Reserved, must be 0. 3948 GFX10 3949 Controls the behavior of the 3950 s_waitcnt's vmcnt and vscnt 3951 counters. 3952 3953 - If 0 vmcnt reports completion 3954 of load and atomic with return 3955 out of order with sample 3956 instructions, and the vscnt 3957 reports the completion of 3958 store and atomic without 3959 return in order. 3960 - If 1 vmcnt reports completion 3961 of load, atomic with return 3962 and sample instructions in 3963 order, and the vscnt reports 3964 the completion of store and 3965 atomic without return in order. 3966 3967 Used by CP to set up 3968 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 3969 31 1 bit FWD_PROGRESS GFX6-GFX9 3970 Reserved, must be 0. 3971 GFX10 3972 - If 0 execute SIMD wavefronts 3973 using oldest first policy. 3974 - If 1 execute SIMD wavefronts to 3975 ensure wavefronts will make some 3976 forward progress. 3977 3978 Used by CP to set up 3979 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 3980 32 **Total size 4 bytes** 3981 ======= =================================================================================================================== 3982 3983.. 3984 3985 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 3986 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 3987 3988 ======= ======= =============================== =========================================================================== 3989 Bits Size Field Name Description 3990 ======= ======= =============================== =========================================================================== 3991 0 1 bit ENABLE_PRIVATE_SEGMENT Enable the setup of the 3992 private segment. 3993 3994 In addition, enable the 3995 setup of the SGPR 3996 wavefront scratch offset 3997 system register (see 3998 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3999 4000 Used by CP to set up 4001 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4002 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4003 user data registers 4004 requested. This number must 4005 match the number of user 4006 data registers enabled. 4007 4008 Used by CP to set up 4009 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4010 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4011 4012 This bit represents 4013 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4014 which is set by the CP if 4015 the runtime has installed a 4016 trap handler. 4017 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4018 system SGPR register for 4019 the work-group id in the X 4020 dimension (see 4021 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4022 4023 Used by CP to set up 4024 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4025 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4026 system SGPR register for 4027 the work-group id in the Y 4028 dimension (see 4029 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4030 4031 Used by CP to set up 4032 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4033 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4034 system SGPR register for 4035 the work-group id in the Z 4036 dimension (see 4037 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4038 4039 Used by CP to set up 4040 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4041 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4042 system SGPR register for 4043 work-group information (see 4044 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4045 4046 Used by CP to set up 4047 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4048 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4049 VGPR system registers used 4050 for the work-item ID. 4051 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4052 defines the values. 4053 4054 Used by CP to set up 4055 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4056 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4057 4058 Wavefront starts execution 4059 with address watch 4060 exceptions enabled which 4061 are generated when L1 has 4062 witnessed a thread access 4063 an *address of 4064 interest*. 4065 4066 CP is responsible for 4067 filling in the address 4068 watch bit in 4069 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4070 according to what the 4071 runtime requests. 4072 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4073 4074 Wavefront starts execution 4075 with memory violation 4076 exceptions exceptions 4077 enabled which are generated 4078 when a memory violation has 4079 occurred for this wavefront from 4080 L1 or LDS 4081 (write-to-read-only-memory, 4082 mis-aligned atomic, LDS 4083 address out of range, 4084 illegal address, etc.). 4085 4086 CP sets the memory 4087 violation bit in 4088 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4089 according to what the 4090 runtime requests. 4091 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4092 4093 CP uses the rounded value 4094 from the dispatch packet, 4095 not this value, as the 4096 dispatch may contain 4097 dynamically allocated group 4098 segment memory. CP writes 4099 directly to 4100 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4101 4102 Amount of group segment 4103 (LDS) to allocate for each 4104 work-group. Granularity is 4105 device specific: 4106 4107 GFX6: 4108 roundup(lds-size / (64 * 4)) 4109 GFX7-GFX10: 4110 roundup(lds-size / (128 * 4)) 4111 4112 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4113 _INVALID_OPERATION with specified exceptions 4114 enabled. 4115 4116 Used by CP to set up 4117 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4118 (set from bits 0..6). 4119 4120 IEEE 754 FP Invalid 4121 Operation 4122 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4123 _SOURCE input operands is a 4124 denormal number 4125 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4126 _DIVISION_BY_ZERO Zero 4127 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4128 _OVERFLOW 4129 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4130 _UNDERFLOW 4131 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4132 _INEXACT 4133 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4134 _ZERO (rcp_iflag_f32 instruction 4135 only) 4136 31 1 bit Reserved, must be 0. 4137 32 **Total size 4 bytes.** 4138 ======= =================================================================================================================== 4139 4140.. 4141 4142 .. table:: compute_pgm_rsrc3 for GFX90A 4143 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4144 4145 ======= ======= =============================== =========================================================================== 4146 Bits Size Field Name Description 4147 ======= ======= =============================== =========================================================================== 4148 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4149 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4150 63 - accum-offset = 256. 4151 6:15 10 Reserved, must be 0. 4152 bits 4153 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4154 launched in the same CU. 4155 - If 1 the waves of a work-group can be 4156 launched in different CUs. The waves 4157 cannot use S_BARRIER or LDS. 4158 17:31 15 Reserved, must be 0. 4159 bits 4160 32 **Total size 4 bytes.** 4161 ======= =================================================================================================================== 4162 4163.. 4164 4165 .. table:: compute_pgm_rsrc3 for GFX10 4166 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4167 4168 ======= ======= =============================== =========================================================================== 4169 Bits Size Field Name Description 4170 ======= ======= =============================== =========================================================================== 4171 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 4172 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 4173 31:4 28 Reserved, must be 0. 4174 bits 4175 32 **Total size 4 bytes.** 4176 ======= =================================================================================================================== 4177 4178.. 4179 4180 .. table:: Floating Point Rounding Mode Enumeration Values 4181 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4182 4183 ====================================== ===== ============================== 4184 Enumeration Name Value Description 4185 ====================================== ===== ============================== 4186 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4187 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4188 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4189 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4190 ====================================== ===== ============================== 4191 4192.. 4193 4194 .. table:: Floating Point Denorm Mode Enumeration Values 4195 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4196 4197 ====================================== ===== ============================== 4198 Enumeration Name Value Description 4199 ====================================== ===== ============================== 4200 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4201 Denorms 4202 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4203 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4204 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4205 ====================================== ===== ============================== 4206 4207.. 4208 4209 .. table:: System VGPR Work-Item ID Enumeration Values 4210 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4211 4212 ======================================== ===== ============================ 4213 Enumeration Name Value Description 4214 ======================================== ===== ============================ 4215 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4216 ID. 4217 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4218 dimensions ID. 4219 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4220 dimensions ID. 4221 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4222 ======================================== ===== ============================ 4223 4224.. _amdgpu-amdhsa-initial-kernel-execution-state: 4225 4226Initial Kernel Execution State 4227~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4228 4229This section defines the register state that will be set up by the packet 4230processor prior to the start of execution of every wavefront. This is limited by 4231the constraints of the hardware controllers of CP/ADC/SPI. 4232 4233The order of the SGPR registers is defined, but the compiler can specify which 4234ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4235fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4236for enabled registers are dense starting at SGPR0: the first enabled register is 4237SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4238an SGPR number. 4239 4240The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4241all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4242using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4243actually initialized. These are then immediately followed by the System SGPRs 4244that are set up by ADC/SPI and can have different values for each wavefront of 4245the grid dispatch. 4246 4247SGPR register initial state is defined in 4248:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4249 4250 .. table:: SGPR Register Set Up Order 4251 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4252 4253 ========== ========================== ====== ============================== 4254 SGPR Order Name Number Description 4255 (kernel descriptor enable of 4256 field) SGPRs 4257 ========== ========================== ====== ============================== 4258 First Private Segment Buffer 4 See 4259 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4260 _segment_buffer) 4261 then Dispatch Ptr 2 64-bit address of AQL dispatch 4262 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4263 actually executing. 4264 then Queue Ptr 2 64-bit address of amd_queue_t 4265 (enable_sgpr_queue_ptr) object for AQL queue on which 4266 the dispatch packet was 4267 queued. 4268 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4269 (enable_sgpr_kernarg segment. This is directly 4270 _segment_ptr) copied from the 4271 kernarg_address in the kernel 4272 dispatch packet. 4273 4274 Having CP load it once avoids 4275 loading it at the beginning of 4276 every wavefront. 4277 then Dispatch Id 2 64-bit Dispatch ID of the 4278 (enable_sgpr_dispatch_id) dispatch packet being 4279 executed. 4280 then Flat Scratch Init 2 See 4281 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4282 then Private Segment Size 1 The 32-bit byte size of a 4283 (enable_sgpr_private single 4284 work-item's 4285 scratch_segment_size) memory 4286 allocation. This is the 4287 value from the kernel 4288 dispatch packet Private 4289 Segment Byte Size rounded up 4290 by CP to a multiple of 4291 DWORD. 4292 4293 Having CP load it once avoids 4294 loading it at the beginning of 4295 every wavefront. 4296 4297 This is not used for 4298 GFX7-GFX8 since it is the same 4299 value as the second SGPR of 4300 Flat Scratch Init. However, it 4301 may be needed for GFX9-GFX10 which 4302 changes the meaning of the 4303 Flat Scratch Init value. 4304 then Grid Work-Group Count X 1 32-bit count of the number of 4305 (enable_sgpr_grid work-groups in the X dimension 4306 _workgroup_count_X) for the grid being 4307 executed. Computed from the 4308 fields in the kernel dispatch 4309 packet as ((grid_size.x + 4310 workgroup_size.x - 1) / 4311 workgroup_size.x). 4312 then Grid Work-Group Count Y 1 32-bit count of the number of 4313 (enable_sgpr_grid work-groups in the Y dimension 4314 _workgroup_count_Y && for the grid being 4315 less than 16 previous executed. Computed from the 4316 SGPRs) fields in the kernel dispatch 4317 packet as ((grid_size.y + 4318 workgroup_size.y - 1) / 4319 workgroupSize.y). 4320 4321 Only initialized if <16 4322 previous SGPRs initialized. 4323 then Grid Work-Group Count Z 1 32-bit count of the number of 4324 (enable_sgpr_grid work-groups in the Z dimension 4325 _workgroup_count_Z && for the grid being 4326 less than 16 previous executed. Computed from the 4327 SGPRs) fields in the kernel dispatch 4328 packet as ((grid_size.z + 4329 workgroup_size.z - 1) / 4330 workgroupSize.z). 4331 4332 Only initialized if <16 4333 previous SGPRs initialized. 4334 then Work-Group Id X 1 32-bit work-group id in X 4335 (enable_sgpr_workgroup_id dimension of grid for 4336 _X) wavefront. 4337 then Work-Group Id Y 1 32-bit work-group id in Y 4338 (enable_sgpr_workgroup_id dimension of grid for 4339 _Y) wavefront. 4340 then Work-Group Id Z 1 32-bit work-group id in Z 4341 (enable_sgpr_workgroup_id dimension of grid for 4342 _Z) wavefront. 4343 then Work-Group Info 1 {first_wavefront, 14'b0000, 4344 (enable_sgpr_workgroup ordered_append_term[10:0], 4345 _info) threadgroup_size_in_wavefronts[5:0]} 4346 then Scratch Wavefront Offset 1 See 4347 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4348 _segment_wavefront_offset) and 4349 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4350 ========== ========================== ====== ============================== 4351 4352The order of the VGPR registers is defined, but the compiler can specify which 4353ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4354fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4355for enabled registers are dense starting at VGPR0: the first enabled register is 4356VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4357VGPR number. 4358 4359There are different methods used for the VGPR initial state: 4360 4361* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4362 specifies otherwise, a separate VGPR register is used per work-item ID. The 4363 VGPR register initial state for this method is defined in 4364 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4365* If *Target Properties* column of :ref:`amdgpu-processor-table` 4366 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4367 for all work-item IDs. The register layout for this method is defined in 4368 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4369 4370 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4371 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4372 4373 ========== ========================== ====== ============================== 4374 VGPR Order Name Number Description 4375 (kernel descriptor enable of 4376 field) VGPRs 4377 ========== ========================== ====== ============================== 4378 First Work-Item Id X 1 32-bit work-item id in X 4379 (Always initialized) dimension of work-group for 4380 wavefront lane. 4381 then Work-Item Id Y 1 32-bit work-item id in Y 4382 (enable_vgpr_workitem_id dimension of work-group for 4383 > 0) wavefront lane. 4384 then Work-Item Id Z 1 32-bit work-item id in Z 4385 (enable_vgpr_workitem_id dimension of work-group for 4386 > 1) wavefront lane. 4387 ========== ========================== ====== ============================== 4388 4389.. 4390 4391 .. table:: Register Layout for Packed Work-Item ID Method 4392 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4393 4394 ======= ======= ================ ========================================= 4395 Bits Size Field Name Description 4396 ======= ======= ================ ========================================= 4397 0:9 10 bits Work-Item Id X Work-item id in X 4398 dimension of work-group for 4399 wavefront lane. 4400 4401 Always initialized. 4402 4403 10:19 10 bits Work-Item Id Y Work-item id in Y 4404 dimension of work-group for 4405 wavefront lane. 4406 4407 Initialized if enable_vgpr_workitem_id > 4408 0, otherwise set to 0. 4409 20:29 10 bits Work-Item Id Z Work-item id in Z 4410 dimension of work-group for 4411 wavefront lane. 4412 4413 Initialized if enable_vgpr_workitem_id > 4414 1, otherwise set to 0. 4415 30:31 2 bits Reserved, set to 0. 4416 ======= ======= ================ ========================================= 4417 4418The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4419 44201. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4421 registers. 44222. Work-group Id registers X, Y, Z are set by ADC which supports any 4423 combination including none. 44243. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4425 its value cannot be included with the flat scratch init value which is per 4426 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 44274. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4428 or (X, Y, Z). 44295. Flat Scratch register pair initialization is described in 4430 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4431 4432The global segment can be accessed either using buffer instructions (GFX6 which 4433has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4434instructions (GFX9-GFX10). 4435 4436If buffer operations are used, then the compiler can generate a V# with the 4437following properties: 4438 4439* base address of 0 4440* no swizzle 4441* ATC: 1 if IOMMU present (such as APU) 4442* ptr64: 1 4443* MTYPE set to support memory coherence that matches the runtime (such as CC for 4444 APU and NC for dGPU). 4445 4446.. _amdgpu-amdhsa-kernel-prolog: 4447 4448Kernel Prolog 4449~~~~~~~~~~~~~ 4450 4451The compiler performs initialization in the kernel prologue depending on the 4452target and information about things like stack usage in the kernel and called 4453functions. Some of this initialization requires the compiler to request certain 4454User and System SGPRs be present in the 4455:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4456:ref:`amdgpu-amdhsa-kernel-descriptor`. 4457 4458.. _amdgpu-amdhsa-kernel-prolog-cfi: 4459 4460CFI 4461+++ 4462 44631. The CFI return address is undefined. 4464 44652. The CFI CFA is defined using an expression which evaluates to a location 4466 description that comprises one memory location description for the 4467 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4468 4469.. _amdgpu-amdhsa-kernel-prolog-m0: 4470 4471M0 4472++ 4473 4474GFX6-GFX8 4475 The M0 register must be initialized with a value at least the total LDS size 4476 if the kernel may access LDS via DS or flat operations. Total LDS size is 4477 available in dispatch packet. For M0, it is also possible to use maximum 4478 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4479 GFX7-GFX8). 4480GFX9-GFX10 4481 The M0 register is not used for range checking LDS accesses and so does not 4482 need to be initialized in the prolog. 4483 4484.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4485 4486Stack Pointer 4487+++++++++++++ 4488 4489If the kernel has function calls it must set up the ABI stack pointer described 4490in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4491SGPR32 to the unswizzled scratch offset of the address past the last local 4492allocation. 4493 4494.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4495 4496Frame Pointer 4497+++++++++++++ 4498 4499If the kernel needs a frame pointer for the reasons defined in 4500``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4501kernel prolog. If a frame pointer is not required then all uses of the frame 4502pointer are replaced with immediate ``0`` offsets. 4503 4504.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4505 4506Flat Scratch 4507++++++++++++ 4508 4509There are different methods used for initializing flat scratch: 4510 4511* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4512 specifies *Does not support generic address space*: 4513 4514 Flat scratch is not supported and there is no flat scratch register pair. 4515 4516* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4517 specifies *Offset flat scratch*: 4518 4519 If the kernel or any function it calls may use flat operations to access 4520 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4521 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4522 Scratch Wavefront Offset SGPR registers (see 4523 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4524 4525 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4526 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4527 being managed by SPI for the queue executing the kernel dispatch. This is 4528 the same value used in the Scratch Segment Buffer V# base address. 4529 4530 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4531 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4532 4533 The prolog must add the value of Scratch Wavefront Offset to get the 4534 wavefront's byte scratch backing memory offset from 4535 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4536 4537 The Scratch Wavefront Offset must also be used as an offset with Private 4538 segment address when using the Scratch Segment Buffer. 4539 4540 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4541 shifted by 8 before moving into FLAT_SCRATCH_HI. 4542 4543 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4544 SGPRn is the highest numbered SGPR allocated to the wavefront). 4545 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4546 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4547 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4548 aperture. 4549 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4550 work-items scratch memory usage. 4551 4552 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4553 checks that the value in the kernel dispatch packet Private Segment Byte 4554 Size is not larger and requests the runtime to increase the queue's scratch 4555 size if necessary. 4556 4557 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4558 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4559 loading it at the beginning of every wavefront. 4560 4561 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4562 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4563 in flat memory instructions. 4564 4565* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4566 specifies *Absolute flat scratch*: 4567 4568 If the kernel or any function it calls may use flat operations to access 4569 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4570 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4571 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4572 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4573 4574 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4575 memory being managed by SPI for the queue executing the kernel dispatch. 4576 4577 CP obtains this from the runtime. 4578 4579 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4580 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4581 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4582 memory instructions. 4583 4584 The Scratch Wavefront Offset must also be used as an offset with Private 4585 segment address when using the Scratch Segment Buffer (see 4586 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4587 4588.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4589 4590Private Segment Buffer 4591++++++++++++++++++++++ 4592 4593Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4594that are used as a V# to access scratch. CP uses the value provided by the 4595runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4596access the private memory space using a segment address. See 4597:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4598 4599The scratch V# is a four-aligned SGPR and always selected for the kernel as 4600follows: 4601 4602 - If it is known during instruction selection that there is stack usage, 4603 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4604 optimizations are disabled (``-O0``), if stack objects already exist (for 4605 locals, etc.), or if there are any function calls. 4606 4607 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4608 are reserved for the tentative scratch V#. These will be used if it is 4609 determined that spilling is needed. 4610 4611 - If no use is made of the tentative scratch V#, then it is unreserved, 4612 and the register count is determined ignoring it. 4613 - If use is made of the tentative scratch V#, then its register numbers 4614 are shifted to the first four-aligned SGPR index after the highest one 4615 allocated by the register allocator, and all uses are updated. The 4616 register count includes them in the shifted location. 4617 - In either case, if the processor has the SGPR allocation bug, the 4618 tentative allocation is not shifted or unreserved in order to ensure 4619 the register count is higher to workaround the bug. 4620 4621 .. note:: 4622 4623 This approach of using a tentative scratch V# and shifting the register 4624 numbers if used avoids having to perform register allocation a second 4625 time if the tentative V# is eliminated. This is more efficient and 4626 avoids the problem that the second register allocation may perform 4627 spilling which will fail as there is no longer a scratch V#. 4628 4629When the kernel prolog code is being emitted it is known whether the scratch V# 4630described above is actually used. If it is, the prolog code must set it up by 4631copying the Private Segment Buffer to the scratch V# registers and then adding 4632the Private Segment Wavefront Offset to the queue base address in the V#. The 4633result is a V# with a base address pointing to the beginning of the wavefront 4634scratch backing memory. 4635 4636The Private Segment Buffer is always requested, but the Private Segment 4637Wavefront Offset is only requested if it is used (see 4638:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4639 4640.. _amdgpu-amdhsa-memory-model: 4641 4642Memory Model 4643~~~~~~~~~~~~ 4644 4645This section describes the mapping of the LLVM memory model onto AMDGPU machine 4646code (see :ref:`memmodel`). 4647 4648The AMDGPU backend supports the memory synchronization scopes specified in 4649:ref:`amdgpu-memory-scopes`. 4650 4651The code sequences used to implement the memory model specify the order of 4652instructions that a single thread must execute. The ``s_waitcnt`` and cache 4653management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4654to other memory instructions executed by the same thread. This allows them to be 4655moved earlier or later which can allow them to be combined with other instances 4656of the same instruction, or hoisted/sunk out of loops to improve performance. 4657Only the instructions related to the memory model are given; additional 4658``s_waitcnt`` instructions are required to ensure registers are defined before 4659being used. These may be able to be combined with the memory model ``s_waitcnt`` 4660instructions as described above. 4661 4662The AMDGPU backend supports the following memory models: 4663 4664 HSA Memory Model [HSA]_ 4665 The HSA memory model uses a single happens-before relation for all address 4666 spaces (see :ref:`amdgpu-address-spaces`). 4667 OpenCL Memory Model [OpenCL]_ 4668 The OpenCL memory model which has separate happens-before relations for the 4669 global and local address spaces. Only a fence specifying both global and 4670 local address space, and seq_cst instructions join the relationships. Since 4671 the LLVM ``memfence`` instruction does not allow an address space to be 4672 specified the OpenCL fence has to conservatively assume both local and 4673 global address space was specified. However, optimizations can often be 4674 done to eliminate the additional ``s_waitcnt`` instructions when there are 4675 no intervening memory instructions which access the corresponding address 4676 space. The code sequences in the table indicate what can be omitted for the 4677 OpenCL memory. The target triple environment is used to determine if the 4678 source language is OpenCL (see :ref:`amdgpu-opencl`). 4679 4680``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4681operations. 4682 4683``buffer/global/flat_load/store/atomic`` instructions to global memory are 4684termed vector memory operations. 4685 4686Private address space uses ``buffer_load/store`` using the scratch V# 4687(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4688is accessing the memory, atomic memory orderings are not meaningful, and all 4689accesses are treated as non-atomic. 4690 4691Constant address space uses ``buffer/global_load`` instructions (or equivalent 4692scalar memory instructions). Since the constant address space contents do not 4693change during the execution of a kernel dispatch it is not legal to perform 4694stores, and atomic memory orderings are not meaningful, and all accesses are 4695treated as non-atomic. 4696 4697A memory synchronization scope wider than work-group is not meaningful for the 4698group (LDS) address space and is treated as work-group. 4699 4700The memory model does not support the region address space which is treated as 4701non-atomic. 4702 4703Acquire memory ordering is not meaningful on store atomic instructions and is 4704treated as non-atomic. 4705 4706Release memory ordering is not meaningful on load atomic instructions and is 4707treated a non-atomic. 4708 4709Acquire-release memory ordering is not meaningful on load or store atomic 4710instructions and is treated as acquire and release respectively. 4711 4712The memory order also adds the single thread optimization constraints defined in 4713table 4714:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4715 4716 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4717 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4718 4719 ============ ============================================================== 4720 LLVM Memory Optimization Constraints 4721 Ordering 4722 ============ ============================================================== 4723 unordered *none* 4724 monotonic *none* 4725 acquire - If a load atomic/atomicrmw then no following load/load 4726 atomic/store/store atomic/atomicrmw/fence instruction can be 4727 moved before the acquire. 4728 - If a fence then same as load atomic, plus no preceding 4729 associated fence-paired-atomic can be moved after the fence. 4730 release - If a store atomic/atomicrmw then no preceding load/load 4731 atomic/store/store atomic/atomicrmw/fence instruction can be 4732 moved after the release. 4733 - If a fence then same as store atomic, plus no following 4734 associated fence-paired-atomic can be moved before the 4735 fence. 4736 acq_rel Same constraints as both acquire and release. 4737 seq_cst - If a load atomic then same constraints as acquire, plus no 4738 preceding sequentially consistent load atomic/store 4739 atomic/atomicrmw/fence instruction can be moved after the 4740 seq_cst. 4741 - If a store atomic then the same constraints as release, plus 4742 no following sequentially consistent load atomic/store 4743 atomic/atomicrmw/fence instruction can be moved before the 4744 seq_cst. 4745 - If an atomicrmw/fence then same constraints as acq_rel. 4746 ============ ============================================================== 4747 4748The code sequences used to implement the memory model are defined in the 4749following sections: 4750 4751* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 4752* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 4753* :ref:`amdgpu-amdhsa-memory-model-gfx10` 4754 4755.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 4756 4757Memory Model GFX6-GFX9 4758++++++++++++++++++++++ 4759 4760For GFX6-GFX9: 4761 4762* Each agent has multiple shader arrays (SA). 4763* Each SA has multiple compute units (CU). 4764* Each CU has multiple SIMDs that execute wavefronts. 4765* The wavefronts for a single work-group are executed in the same CU but may be 4766 executed by different SIMDs. 4767* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4768 executing on it. 4769* All LDS operations of a CU are performed as wavefront wide operations in a 4770 global order and involve no caching. Completion is reported to a wavefront in 4771 execution order. 4772* The LDS memory has multiple request queues shared by the SIMDs of a 4773 CU. Therefore, the LDS operations performed by different wavefronts of a 4774 work-group can be reordered relative to each other, which can result in 4775 reordering the visibility of vector memory operations with respect to LDS 4776 operations of other wavefronts in the same work-group. A ``s_waitcnt 4777 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4778 vector memory operations between wavefronts of a work-group, but not between 4779 operations performed by the same wavefront. 4780* The vector memory operations are performed as wavefront wide operations and 4781 completion is reported to a wavefront in execution order. The exception is 4782 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4783 vector memory order if they access LDS memory, and out of LDS operation order 4784 if they access global memory. 4785* The vector memory operations access a single vector L1 cache shared by all 4786 SIMDs a CU. Therefore, no special action is required for coherence between the 4787 lanes of a single wavefront, or for coherence between wavefronts in the same 4788 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4789 wavefronts executing in different work-groups as they may be executing on 4790 different CUs. 4791* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4792 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4793 scalar operations are used in a restricted way so do not impact the memory 4794 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 4795* The vector and scalar memory operations use an L2 cache shared by all CUs on 4796 the same agent. 4797* The L2 cache has independent channels to service disjoint ranges of virtual 4798 addresses. 4799* Each CU has a separate request queue per channel. Therefore, the vector and 4800 scalar memory operations performed by wavefronts executing in different 4801 work-groups (which may be executing on different CUs) of an agent can be 4802 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4803 ensure synchronization between vector memory operations of different CUs. It 4804 ensures a previous vector memory operation has completed before executing a 4805 subsequent vector memory or LDS operation and so can be used to meet the 4806 requirements of acquire and release. 4807* The L2 cache can be kept coherent with other agents on some targets, or ranges 4808 of virtual addresses can be set up to bypass it to ensure system coherence. 4809 4810Scalar memory operations are only used to access memory that is proven to not 4811change during the execution of the kernel dispatch. This includes constant 4812address space and global address space for program scope ``const`` variables. 4813Therefore, the kernel machine code does not have to maintain the scalar cache to 4814ensure it is coherent with the vector caches. The scalar and vector caches are 4815invalidated between kernel dispatches by CP since constant address space data 4816may change between kernel dispatch executions. See 4817:ref:`amdgpu-amdhsa-memory-spaces`. 4818 4819The one exception is if scalar writes are used to spill SGPR registers. In this 4820case the AMDGPU backend ensures the memory location used to spill is never 4821accessed by vector memory operations at the same time. If scalar writes are used 4822then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4823return since the locations may be used for vector memory instructions by a 4824future wavefront that uses the same scratch area, or a function call that 4825creates a frame at the same address, respectively. There is no need for a 4826``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4827 4828For kernarg backing memory: 4829 4830* CP invalidates the L1 cache at the start of each kernel dispatch. 4831* On dGPU the kernarg backing memory is allocated in host memory accessed as 4832 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 4833 causes it to be treated as non-volatile and so is not invalidated by 4834 ``*_vol``. 4835* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 4836 and so the L2 cache will be coherent with the CPU and other agents. 4837 4838Scratch backing memory (which is used for the private address space) is accessed 4839with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 4840only accessed by a single thread, and is always write-before-read, there is 4841never a need to invalidate these entries from the L1 cache. Hence all cache 4842invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 4843 4844The code sequences used to implement the memory model for GFX6-GFX9 are defined 4845in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 4846 4847 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 4848 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 4849 4850 ============ ============ ============== ========== ================================ 4851 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 4852 Ordering Sync Scope Address GFX6-GFX9 4853 Space 4854 ============ ============ ============== ========== ================================ 4855 **Non-Atomic** 4856 ------------------------------------------------------------------------------------ 4857 load *none* *none* - global - !volatile & !nontemporal 4858 - generic 4859 - private 1. buffer/global/flat_load 4860 - constant 4861 - !volatile & nontemporal 4862 4863 1. buffer/global/flat_load 4864 glc=1 slc=1 4865 4866 - volatile 4867 4868 1. buffer/global/flat_load 4869 glc=1 4870 2. s_waitcnt vmcnt(0) 4871 4872 - Must happen before 4873 any following volatile 4874 global/generic 4875 load/store. 4876 - Ensures that 4877 volatile 4878 operations to 4879 different 4880 addresses will not 4881 be reordered by 4882 hardware. 4883 4884 load *none* *none* - local 1. ds_load 4885 store *none* *none* - global - !volatile & !nontemporal 4886 - generic 4887 - private 1. buffer/global/flat_store 4888 - constant 4889 - !volatile & nontemporal 4890 4891 1. buffer/global/flat_store 4892 glc=1 slc=1 4893 4894 - volatile 4895 4896 1. buffer/global/flat_store 4897 2. s_waitcnt vmcnt(0) 4898 4899 - Must happen before 4900 any following volatile 4901 global/generic 4902 load/store. 4903 - Ensures that 4904 volatile 4905 operations to 4906 different 4907 addresses will not 4908 be reordered by 4909 hardware. 4910 4911 store *none* *none* - local 1. ds_store 4912 **Unordered Atomic** 4913 ------------------------------------------------------------------------------------ 4914 load atomic unordered *any* *any* *Same as non-atomic*. 4915 store atomic unordered *any* *any* *Same as non-atomic*. 4916 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 4917 **Monotonic Atomic** 4918 ------------------------------------------------------------------------------------ 4919 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 4920 - wavefront - local 4921 - workgroup - generic 4922 load atomic monotonic - agent - global 1. buffer/global/flat_load 4923 - system - generic glc=1 4924 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 4925 - wavefront - generic 4926 - workgroup 4927 - agent 4928 - system 4929 store atomic monotonic - singlethread - local 1. ds_store 4930 - wavefront 4931 - workgroup 4932 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 4933 - wavefront - generic 4934 - workgroup 4935 - agent 4936 - system 4937 atomicrmw monotonic - singlethread - local 1. ds_atomic 4938 - wavefront 4939 - workgroup 4940 **Acquire Atomic** 4941 ------------------------------------------------------------------------------------ 4942 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 4943 - wavefront - local 4944 - generic 4945 load atomic acquire - workgroup - global 1. buffer/global_load 4946 load atomic acquire - workgroup - local 1. ds/flat_load 4947 - generic 2. s_waitcnt lgkmcnt(0) 4948 4949 - If OpenCL, omit. 4950 - Must happen before 4951 any following 4952 global/generic 4953 load/load 4954 atomic/store/store 4955 atomic/atomicrmw. 4956 - Ensures any 4957 following global 4958 data read is no 4959 older than a local load 4960 atomic value being 4961 acquired. 4962 4963 load atomic acquire - agent - global 1. buffer/global_load 4964 - system glc=1 4965 2. s_waitcnt vmcnt(0) 4966 4967 - Must happen before 4968 following 4969 buffer_wbinvl1_vol. 4970 - Ensures the load 4971 has completed 4972 before invalidating 4973 the cache. 4974 4975 3. buffer_wbinvl1_vol 4976 4977 - Must happen before 4978 any following 4979 global/generic 4980 load/load 4981 atomic/atomicrmw. 4982 - Ensures that 4983 following 4984 loads will not see 4985 stale global data. 4986 4987 load atomic acquire - agent - generic 1. flat_load glc=1 4988 - system 2. s_waitcnt vmcnt(0) & 4989 lgkmcnt(0) 4990 4991 - If OpenCL omit 4992 lgkmcnt(0). 4993 - Must happen before 4994 following 4995 buffer_wbinvl1_vol. 4996 - Ensures the flat_load 4997 has completed 4998 before invalidating 4999 the cache. 5000 5001 3. buffer_wbinvl1_vol 5002 5003 - Must happen before 5004 any following 5005 global/generic 5006 load/load 5007 atomic/atomicrmw. 5008 - Ensures that 5009 following loads 5010 will not see stale 5011 global data. 5012 5013 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5014 - wavefront - local 5015 - generic 5016 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5017 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5018 - generic 2. s_waitcnt lgkmcnt(0) 5019 5020 - If OpenCL, omit. 5021 - Must happen before 5022 any following 5023 global/generic 5024 load/load 5025 atomic/store/store 5026 atomic/atomicrmw. 5027 - Ensures any 5028 following global 5029 data read is no 5030 older than a local 5031 atomicrmw value 5032 being acquired. 5033 5034 atomicrmw acquire - agent - global 1. buffer/global_atomic 5035 - system 2. s_waitcnt vmcnt(0) 5036 5037 - Must happen before 5038 following 5039 buffer_wbinvl1_vol. 5040 - Ensures the 5041 atomicrmw has 5042 completed before 5043 invalidating the 5044 cache. 5045 5046 3. buffer_wbinvl1_vol 5047 5048 - Must happen before 5049 any following 5050 global/generic 5051 load/load 5052 atomic/atomicrmw. 5053 - Ensures that 5054 following loads 5055 will not see stale 5056 global data. 5057 5058 atomicrmw acquire - agent - generic 1. flat_atomic 5059 - system 2. s_waitcnt vmcnt(0) & 5060 lgkmcnt(0) 5061 5062 - If OpenCL, omit 5063 lgkmcnt(0). 5064 - Must happen before 5065 following 5066 buffer_wbinvl1_vol. 5067 - Ensures the 5068 atomicrmw has 5069 completed before 5070 invalidating the 5071 cache. 5072 5073 3. buffer_wbinvl1_vol 5074 5075 - Must happen before 5076 any following 5077 global/generic 5078 load/load 5079 atomic/atomicrmw. 5080 - Ensures that 5081 following loads 5082 will not see stale 5083 global data. 5084 5085 fence acquire - singlethread *none* *none* 5086 - wavefront 5087 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5088 5089 - If OpenCL and 5090 address space is 5091 not generic, omit. 5092 - However, since LLVM 5093 currently has no 5094 address space on 5095 the fence need to 5096 conservatively 5097 always generate. If 5098 fence had an 5099 address space then 5100 set to address 5101 space of OpenCL 5102 fence flag, or to 5103 generic if both 5104 local and global 5105 flags are 5106 specified. 5107 - Must happen after 5108 any preceding 5109 local/generic load 5110 atomic/atomicrmw 5111 with an equal or 5112 wider sync scope 5113 and memory ordering 5114 stronger than 5115 unordered (this is 5116 termed the 5117 fence-paired-atomic). 5118 - Must happen before 5119 any following 5120 global/generic 5121 load/load 5122 atomic/store/store 5123 atomic/atomicrmw. 5124 - Ensures any 5125 following global 5126 data read is no 5127 older than the 5128 value read by the 5129 fence-paired-atomic. 5130 5131 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5132 - system vmcnt(0) 5133 5134 - If OpenCL and 5135 address space is 5136 not generic, omit 5137 lgkmcnt(0). 5138 - However, since LLVM 5139 currently has no 5140 address space on 5141 the fence need to 5142 conservatively 5143 always generate 5144 (see comment for 5145 previous fence). 5146 - Could be split into 5147 separate s_waitcnt 5148 vmcnt(0) and 5149 s_waitcnt 5150 lgkmcnt(0) to allow 5151 them to be 5152 independently moved 5153 according to the 5154 following rules. 5155 - s_waitcnt vmcnt(0) 5156 must happen after 5157 any preceding 5158 global/generic load 5159 atomic/atomicrmw 5160 with an equal or 5161 wider sync scope 5162 and memory ordering 5163 stronger than 5164 unordered (this is 5165 termed the 5166 fence-paired-atomic). 5167 - s_waitcnt lgkmcnt(0) 5168 must happen after 5169 any preceding 5170 local/generic load 5171 atomic/atomicrmw 5172 with an equal or 5173 wider sync scope 5174 and memory ordering 5175 stronger than 5176 unordered (this is 5177 termed the 5178 fence-paired-atomic). 5179 - Must happen before 5180 the following 5181 buffer_wbinvl1_vol. 5182 - Ensures that the 5183 fence-paired atomic 5184 has completed 5185 before invalidating 5186 the 5187 cache. Therefore 5188 any following 5189 locations read must 5190 be no older than 5191 the value read by 5192 the 5193 fence-paired-atomic. 5194 5195 2. buffer_wbinvl1_vol 5196 5197 - Must happen before any 5198 following global/generic 5199 load/load 5200 atomic/store/store 5201 atomic/atomicrmw. 5202 - Ensures that 5203 following loads 5204 will not see stale 5205 global data. 5206 5207 **Release Atomic** 5208 ------------------------------------------------------------------------------------ 5209 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5210 - wavefront - local 5211 - generic 5212 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5213 - generic 5214 - If OpenCL, omit. 5215 - Must happen after 5216 any preceding 5217 local/generic 5218 load/store/load 5219 atomic/store 5220 atomic/atomicrmw. 5221 - Must happen before 5222 the following 5223 store. 5224 - Ensures that all 5225 memory operations 5226 to local have 5227 completed before 5228 performing the 5229 store that is being 5230 released. 5231 5232 2. buffer/global/flat_store 5233 store atomic release - workgroup - local 1. ds_store 5234 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5235 - system - generic vmcnt(0) 5236 5237 - If OpenCL and 5238 address space is 5239 not generic, omit 5240 lgkmcnt(0). 5241 - Could be split into 5242 separate s_waitcnt 5243 vmcnt(0) and 5244 s_waitcnt 5245 lgkmcnt(0) to allow 5246 them to be 5247 independently moved 5248 according to the 5249 following rules. 5250 - s_waitcnt vmcnt(0) 5251 must happen after 5252 any preceding 5253 global/generic 5254 load/store/load 5255 atomic/store 5256 atomic/atomicrmw. 5257 - s_waitcnt lgkmcnt(0) 5258 must happen after 5259 any preceding 5260 local/generic 5261 load/store/load 5262 atomic/store 5263 atomic/atomicrmw. 5264 - Must happen before 5265 the following 5266 store. 5267 - Ensures that all 5268 memory operations 5269 to memory have 5270 completed before 5271 performing the 5272 store that is being 5273 released. 5274 5275 2. buffer/global/flat_store 5276 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5277 - wavefront - local 5278 - generic 5279 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5280 - generic 5281 - If OpenCL, omit. 5282 - Must happen after 5283 any preceding 5284 local/generic 5285 load/store/load 5286 atomic/store 5287 atomic/atomicrmw. 5288 - Must happen before 5289 the following 5290 atomicrmw. 5291 - Ensures that all 5292 memory operations 5293 to local have 5294 completed before 5295 performing the 5296 atomicrmw that is 5297 being released. 5298 5299 2. buffer/global/flat_atomic 5300 atomicrmw release - workgroup - local 1. ds_atomic 5301 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5302 - system - generic vmcnt(0) 5303 5304 - If OpenCL, omit 5305 lgkmcnt(0). 5306 - Could be split into 5307 separate s_waitcnt 5308 vmcnt(0) and 5309 s_waitcnt 5310 lgkmcnt(0) to allow 5311 them to be 5312 independently moved 5313 according to the 5314 following rules. 5315 - s_waitcnt vmcnt(0) 5316 must happen after 5317 any preceding 5318 global/generic 5319 load/store/load 5320 atomic/store 5321 atomic/atomicrmw. 5322 - s_waitcnt lgkmcnt(0) 5323 must happen after 5324 any preceding 5325 local/generic 5326 load/store/load 5327 atomic/store 5328 atomic/atomicrmw. 5329 - Must happen before 5330 the following 5331 atomicrmw. 5332 - Ensures that all 5333 memory operations 5334 to global and local 5335 have completed 5336 before performing 5337 the atomicrmw that 5338 is being released. 5339 5340 2. buffer/global/flat_atomic 5341 fence release - singlethread *none* *none* 5342 - wavefront 5343 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5344 5345 - If OpenCL and 5346 address space is 5347 not generic, omit. 5348 - However, since LLVM 5349 currently has no 5350 address space on 5351 the fence need to 5352 conservatively 5353 always generate. If 5354 fence had an 5355 address space then 5356 set to address 5357 space of OpenCL 5358 fence flag, or to 5359 generic if both 5360 local and global 5361 flags are 5362 specified. 5363 - Must happen after 5364 any preceding 5365 local/generic 5366 load/load 5367 atomic/store/store 5368 atomic/atomicrmw. 5369 - Must happen before 5370 any following store 5371 atomic/atomicrmw 5372 with an equal or 5373 wider sync scope 5374 and memory ordering 5375 stronger than 5376 unordered (this is 5377 termed the 5378 fence-paired-atomic). 5379 - Ensures that all 5380 memory operations 5381 to local have 5382 completed before 5383 performing the 5384 following 5385 fence-paired-atomic. 5386 5387 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5388 - system vmcnt(0) 5389 5390 - If OpenCL and 5391 address space is 5392 not generic, omit 5393 lgkmcnt(0). 5394 - If OpenCL and 5395 address space is 5396 local, omit 5397 vmcnt(0). 5398 - However, since LLVM 5399 currently has no 5400 address space on 5401 the fence need to 5402 conservatively 5403 always generate. If 5404 fence had an 5405 address space then 5406 set to address 5407 space of OpenCL 5408 fence flag, or to 5409 generic if both 5410 local and global 5411 flags are 5412 specified. 5413 - Could be split into 5414 separate s_waitcnt 5415 vmcnt(0) and 5416 s_waitcnt 5417 lgkmcnt(0) to allow 5418 them to be 5419 independently moved 5420 according to the 5421 following rules. 5422 - s_waitcnt vmcnt(0) 5423 must happen after 5424 any preceding 5425 global/generic 5426 load/store/load 5427 atomic/store 5428 atomic/atomicrmw. 5429 - s_waitcnt lgkmcnt(0) 5430 must happen after 5431 any preceding 5432 local/generic 5433 load/store/load 5434 atomic/store 5435 atomic/atomicrmw. 5436 - Must happen before 5437 any following store 5438 atomic/atomicrmw 5439 with an equal or 5440 wider sync scope 5441 and memory ordering 5442 stronger than 5443 unordered (this is 5444 termed the 5445 fence-paired-atomic). 5446 - Ensures that all 5447 memory operations 5448 have 5449 completed before 5450 performing the 5451 following 5452 fence-paired-atomic. 5453 5454 **Acquire-Release Atomic** 5455 ------------------------------------------------------------------------------------ 5456 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5457 - wavefront - local 5458 - generic 5459 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5460 5461 - If OpenCL, omit. 5462 - Must happen after 5463 any preceding 5464 local/generic 5465 load/store/load 5466 atomic/store 5467 atomic/atomicrmw. 5468 - Must happen before 5469 the following 5470 atomicrmw. 5471 - Ensures that all 5472 memory operations 5473 to local have 5474 completed before 5475 performing the 5476 atomicrmw that is 5477 being released. 5478 5479 2. buffer/global_atomic 5480 5481 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5482 2. s_waitcnt lgkmcnt(0) 5483 5484 - If OpenCL, omit. 5485 - Must happen before 5486 any following 5487 global/generic 5488 load/load 5489 atomic/store/store 5490 atomic/atomicrmw. 5491 - Ensures any 5492 following global 5493 data read is no 5494 older than the local load 5495 atomic value being 5496 acquired. 5497 5498 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5499 5500 - If OpenCL, omit. 5501 - Must happen after 5502 any preceding 5503 local/generic 5504 load/store/load 5505 atomic/store 5506 atomic/atomicrmw. 5507 - Must happen before 5508 the following 5509 atomicrmw. 5510 - Ensures that all 5511 memory operations 5512 to local have 5513 completed before 5514 performing the 5515 atomicrmw that is 5516 being released. 5517 5518 2. flat_atomic 5519 3. s_waitcnt lgkmcnt(0) 5520 5521 - If OpenCL, omit. 5522 - Must happen before 5523 any following 5524 global/generic 5525 load/load 5526 atomic/store/store 5527 atomic/atomicrmw. 5528 - Ensures any 5529 following global 5530 data read is no 5531 older than a local load 5532 atomic value being 5533 acquired. 5534 5535 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5536 - system vmcnt(0) 5537 5538 - If OpenCL, omit 5539 lgkmcnt(0). 5540 - Could be split into 5541 separate s_waitcnt 5542 vmcnt(0) and 5543 s_waitcnt 5544 lgkmcnt(0) to allow 5545 them to be 5546 independently moved 5547 according to the 5548 following rules. 5549 - s_waitcnt vmcnt(0) 5550 must happen after 5551 any preceding 5552 global/generic 5553 load/store/load 5554 atomic/store 5555 atomic/atomicrmw. 5556 - s_waitcnt lgkmcnt(0) 5557 must happen after 5558 any preceding 5559 local/generic 5560 load/store/load 5561 atomic/store 5562 atomic/atomicrmw. 5563 - Must happen before 5564 the following 5565 atomicrmw. 5566 - Ensures that all 5567 memory operations 5568 to global have 5569 completed before 5570 performing the 5571 atomicrmw that is 5572 being released. 5573 5574 2. buffer/global_atomic 5575 3. s_waitcnt vmcnt(0) 5576 5577 - Must happen before 5578 following 5579 buffer_wbinvl1_vol. 5580 - Ensures the 5581 atomicrmw has 5582 completed before 5583 invalidating the 5584 cache. 5585 5586 4. buffer_wbinvl1_vol 5587 5588 - Must happen before 5589 any following 5590 global/generic 5591 load/load 5592 atomic/atomicrmw. 5593 - Ensures that 5594 following loads 5595 will not see stale 5596 global data. 5597 5598 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5599 - system vmcnt(0) 5600 5601 - If OpenCL, omit 5602 lgkmcnt(0). 5603 - Could be split into 5604 separate s_waitcnt 5605 vmcnt(0) and 5606 s_waitcnt 5607 lgkmcnt(0) to allow 5608 them to be 5609 independently moved 5610 according to the 5611 following rules. 5612 - s_waitcnt vmcnt(0) 5613 must happen after 5614 any preceding 5615 global/generic 5616 load/store/load 5617 atomic/store 5618 atomic/atomicrmw. 5619 - s_waitcnt lgkmcnt(0) 5620 must happen after 5621 any preceding 5622 local/generic 5623 load/store/load 5624 atomic/store 5625 atomic/atomicrmw. 5626 - Must happen before 5627 the following 5628 atomicrmw. 5629 - Ensures that all 5630 memory operations 5631 to global have 5632 completed before 5633 performing the 5634 atomicrmw that is 5635 being released. 5636 5637 2. flat_atomic 5638 3. s_waitcnt vmcnt(0) & 5639 lgkmcnt(0) 5640 5641 - If OpenCL, omit 5642 lgkmcnt(0). 5643 - Must happen before 5644 following 5645 buffer_wbinvl1_vol. 5646 - Ensures the 5647 atomicrmw has 5648 completed before 5649 invalidating the 5650 cache. 5651 5652 4. buffer_wbinvl1_vol 5653 5654 - Must happen before 5655 any following 5656 global/generic 5657 load/load 5658 atomic/atomicrmw. 5659 - Ensures that 5660 following loads 5661 will not see stale 5662 global data. 5663 5664 fence acq_rel - singlethread *none* *none* 5665 - wavefront 5666 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5667 5668 - If OpenCL and 5669 address space is 5670 not generic, omit. 5671 - However, 5672 since LLVM 5673 currently has no 5674 address space on 5675 the fence need to 5676 conservatively 5677 always generate 5678 (see comment for 5679 previous fence). 5680 - Must happen after 5681 any preceding 5682 local/generic 5683 load/load 5684 atomic/store/store 5685 atomic/atomicrmw. 5686 - Must happen before 5687 any following 5688 global/generic 5689 load/load 5690 atomic/store/store 5691 atomic/atomicrmw. 5692 - Ensures that all 5693 memory operations 5694 to local have 5695 completed before 5696 performing any 5697 following global 5698 memory operations. 5699 - Ensures that the 5700 preceding 5701 local/generic load 5702 atomic/atomicrmw 5703 with an equal or 5704 wider sync scope 5705 and memory ordering 5706 stronger than 5707 unordered (this is 5708 termed the 5709 acquire-fence-paired-atomic) 5710 has completed 5711 before following 5712 global memory 5713 operations. This 5714 satisfies the 5715 requirements of 5716 acquire. 5717 - Ensures that all 5718 previous memory 5719 operations have 5720 completed before a 5721 following 5722 local/generic store 5723 atomic/atomicrmw 5724 with an equal or 5725 wider sync scope 5726 and memory ordering 5727 stronger than 5728 unordered (this is 5729 termed the 5730 release-fence-paired-atomic). 5731 This satisfies the 5732 requirements of 5733 release. 5734 5735 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 5736 - system vmcnt(0) 5737 5738 - If OpenCL and 5739 address space is 5740 not generic, omit 5741 lgkmcnt(0). 5742 - However, since LLVM 5743 currently has no 5744 address space on 5745 the fence need to 5746 conservatively 5747 always generate 5748 (see comment for 5749 previous fence). 5750 - Could be split into 5751 separate s_waitcnt 5752 vmcnt(0) and 5753 s_waitcnt 5754 lgkmcnt(0) to allow 5755 them to be 5756 independently moved 5757 according to the 5758 following rules. 5759 - s_waitcnt vmcnt(0) 5760 must happen after 5761 any preceding 5762 global/generic 5763 load/store/load 5764 atomic/store 5765 atomic/atomicrmw. 5766 - s_waitcnt lgkmcnt(0) 5767 must happen after 5768 any preceding 5769 local/generic 5770 load/store/load 5771 atomic/store 5772 atomic/atomicrmw. 5773 - Must happen before 5774 the following 5775 buffer_wbinvl1_vol. 5776 - Ensures that the 5777 preceding 5778 global/local/generic 5779 load 5780 atomic/atomicrmw 5781 with an equal or 5782 wider sync scope 5783 and memory ordering 5784 stronger than 5785 unordered (this is 5786 termed the 5787 acquire-fence-paired-atomic) 5788 has completed 5789 before invalidating 5790 the cache. This 5791 satisfies the 5792 requirements of 5793 acquire. 5794 - Ensures that all 5795 previous memory 5796 operations have 5797 completed before a 5798 following 5799 global/local/generic 5800 store 5801 atomic/atomicrmw 5802 with an equal or 5803 wider sync scope 5804 and memory ordering 5805 stronger than 5806 unordered (this is 5807 termed the 5808 release-fence-paired-atomic). 5809 This satisfies the 5810 requirements of 5811 release. 5812 5813 2. buffer_wbinvl1_vol 5814 5815 - Must happen before 5816 any following 5817 global/generic 5818 load/load 5819 atomic/store/store 5820 atomic/atomicrmw. 5821 - Ensures that 5822 following loads 5823 will not see stale 5824 global data. This 5825 satisfies the 5826 requirements of 5827 acquire. 5828 5829 **Sequential Consistent Atomic** 5830 ------------------------------------------------------------------------------------ 5831 load atomic seq_cst - singlethread - global *Same as corresponding 5832 - wavefront - local load atomic acquire, 5833 - generic except must generated 5834 all instructions even 5835 for OpenCL.* 5836 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 5837 - generic 5838 5839 - Must 5840 happen after 5841 preceding 5842 local/generic load 5843 atomic/store 5844 atomic/atomicrmw 5845 with memory 5846 ordering of seq_cst 5847 and with equal or 5848 wider sync scope. 5849 (Note that seq_cst 5850 fences have their 5851 own s_waitcnt 5852 lgkmcnt(0) and so do 5853 not need to be 5854 considered.) 5855 - Ensures any 5856 preceding 5857 sequential 5858 consistent local 5859 memory instructions 5860 have completed 5861 before executing 5862 this sequentially 5863 consistent 5864 instruction. This 5865 prevents reordering 5866 a seq_cst store 5867 followed by a 5868 seq_cst load. (Note 5869 that seq_cst is 5870 stronger than 5871 acquire/release as 5872 the reordering of 5873 load acquire 5874 followed by a store 5875 release is 5876 prevented by the 5877 s_waitcnt of 5878 the release, but 5879 there is nothing 5880 preventing a store 5881 release followed by 5882 load acquire from 5883 completing out of 5884 order. The s_waitcnt 5885 could be placed after 5886 seq_store or before 5887 the seq_load. We 5888 choose the load to 5889 make the s_waitcnt be 5890 as late as possible 5891 so that the store 5892 may have already 5893 completed.) 5894 5895 2. *Following 5896 instructions same as 5897 corresponding load 5898 atomic acquire, 5899 except must generated 5900 all instructions even 5901 for OpenCL.* 5902 load atomic seq_cst - workgroup - local *Same as corresponding 5903 load atomic acquire, 5904 except must generated 5905 all instructions even 5906 for OpenCL.* 5907 5908 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 5909 - system - generic vmcnt(0) 5910 5911 - Could be split into 5912 separate s_waitcnt 5913 vmcnt(0) 5914 and s_waitcnt 5915 lgkmcnt(0) to allow 5916 them to be 5917 independently moved 5918 according to the 5919 following rules. 5920 - s_waitcnt lgkmcnt(0) 5921 must happen after 5922 preceding 5923 global/generic load 5924 atomic/store 5925 atomic/atomicrmw 5926 with memory 5927 ordering of seq_cst 5928 and with equal or 5929 wider sync scope. 5930 (Note that seq_cst 5931 fences have their 5932 own s_waitcnt 5933 lgkmcnt(0) and so do 5934 not need to be 5935 considered.) 5936 - s_waitcnt vmcnt(0) 5937 must happen after 5938 preceding 5939 global/generic load 5940 atomic/store 5941 atomic/atomicrmw 5942 with memory 5943 ordering of seq_cst 5944 and with equal or 5945 wider sync scope. 5946 (Note that seq_cst 5947 fences have their 5948 own s_waitcnt 5949 vmcnt(0) and so do 5950 not need to be 5951 considered.) 5952 - Ensures any 5953 preceding 5954 sequential 5955 consistent global 5956 memory instructions 5957 have completed 5958 before executing 5959 this sequentially 5960 consistent 5961 instruction. This 5962 prevents reordering 5963 a seq_cst store 5964 followed by a 5965 seq_cst load. (Note 5966 that seq_cst is 5967 stronger than 5968 acquire/release as 5969 the reordering of 5970 load acquire 5971 followed by a store 5972 release is 5973 prevented by the 5974 s_waitcnt of 5975 the release, but 5976 there is nothing 5977 preventing a store 5978 release followed by 5979 load acquire from 5980 completing out of 5981 order. The s_waitcnt 5982 could be placed after 5983 seq_store or before 5984 the seq_load. We 5985 choose the load to 5986 make the s_waitcnt be 5987 as late as possible 5988 so that the store 5989 may have already 5990 completed.) 5991 5992 2. *Following 5993 instructions same as 5994 corresponding load 5995 atomic acquire, 5996 except must generated 5997 all instructions even 5998 for OpenCL.* 5999 store atomic seq_cst - singlethread - global *Same as corresponding 6000 - wavefront - local store atomic release, 6001 - workgroup - generic except must generated 6002 - agent all instructions even 6003 - system for OpenCL.* 6004 atomicrmw seq_cst - singlethread - global *Same as corresponding 6005 - wavefront - local atomicrmw acq_rel, 6006 - workgroup - generic except must generated 6007 - agent all instructions even 6008 - system for OpenCL.* 6009 fence seq_cst - singlethread *none* *Same as corresponding 6010 - wavefront fence acq_rel, 6011 - workgroup except must generated 6012 - agent all instructions even 6013 - system for OpenCL.* 6014 ============ ============ ============== ========== ================================ 6015 6016.. _amdgpu-amdhsa-memory-model-gfx90a: 6017 6018Memory Model GFX90A 6019+++++++++++++++++++ 6020 6021For GFX90A: 6022 6023* Each agent has multiple shader arrays (SA). 6024* Each SA has multiple compute units (CU). 6025* Each CU has multiple SIMDs that execute wavefronts. 6026* The wavefronts for a single work-group are executed in the same CU but may be 6027 executed by different SIMDs. The exception is when in tgsplit execution mode 6028 when the wavefronts may be executed by different SIMDs in different CUs. 6029* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6030 executing on it. The exception is when in tgsplit execution mode when no LDS 6031 is allocated as wavefronts of the same work-group can be in different CUs. 6032* All LDS operations of a CU are performed as wavefront wide operations in a 6033 global order and involve no caching. Completion is reported to a wavefront in 6034 execution order. 6035* The LDS memory has multiple request queues shared by the SIMDs of a 6036 CU. Therefore, the LDS operations performed by different wavefronts of a 6037 work-group can be reordered relative to each other, which can result in 6038 reordering the visibility of vector memory operations with respect to LDS 6039 operations of other wavefronts in the same work-group. A ``s_waitcnt 6040 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6041 vector memory operations between wavefronts of a work-group, but not between 6042 operations performed by the same wavefront. 6043* The vector memory operations are performed as wavefront wide operations and 6044 completion is reported to a wavefront in execution order. The exception is 6045 that ``flat_load/store/atomic`` instructions can report out of vector memory 6046 order if they access LDS memory, and out of LDS operation order if they access 6047 global memory. 6048* The vector memory operations access a single vector L1 cache shared by all 6049 SIMDs a CU. Therefore: 6050 6051 * No special action is required for coherence between the lanes of a single 6052 wavefront. 6053 6054 * No special action is required for coherence between wavefronts in the same 6055 work-group since they execute on the same CU. The exception is when in 6056 tgsplit execution mode as wavefronts of the same work-group can be in 6057 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6058 the following item. 6059 6060 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6061 executing in different work-groups as they may be executing on different 6062 CUs. 6063 6064* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6065 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6066 scalar operations are used in a restricted way so do not impact the memory 6067 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6068* The vector and scalar memory operations use an L2 cache shared by all CUs on 6069 the same agent. 6070 6071 * The L2 cache has independent channels to service disjoint ranges of virtual 6072 addresses. 6073 * Each CU has a separate request queue per channel. Therefore, the vector and 6074 scalar memory operations performed by wavefronts executing in different 6075 work-groups (which may be executing on different CUs), or the same 6076 work-group if executing in tgsplit mode, of an agent can be reordered 6077 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6078 synchronization between vector memory operations of different CUs. It 6079 ensures a previous vector memory operation has completed before executing a 6080 subsequent vector memory or LDS operation and so can be used to meet the 6081 requirements of acquire and release. 6082 * The L2 cache of one agent can be kept coherent with other agents by using 6083 the MTYPE RW (read-write) for memory local to the L2, and MTYPE NC 6084 (non-coherent) with the PTE C-bit set for memory not local to the L2. 6085 6086 * Any local memory cache lines will be automatically invalidated by writes 6087 from CUs associated with other L2 caches, or writes from the CPU, due to 6088 the cache probe caused by the PTE C-bit. 6089 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6090 Subsequent access from the GPU will automatically invalidate or writeback 6091 the CPU cache due to the L2 probe filter. 6092 * Since all work-groups on the same agent share the same L2, no L2 6093 invalidation or writeback is required for coherence. 6094 * To ensure coherence of local memory writes of work-groups in different 6095 agents a ``buffer_wbl2`` is required. It will writeback dirty L2 cache 6096 lines. 6097 * To ensure coherence of local memory reads of work-groups in different 6098 agents a ``buffer_invl2`` is required. It will invalidate non-local L2 6099 cache lines. 6100 6101 * PCIe access from the GPU to the CPU memory can be kept coherent by using the 6102 MTYPE UC (uncached) which bypasses the L2. 6103 6104Scalar memory operations are only used to access memory that is proven to not 6105change during the execution of the kernel dispatch. This includes constant 6106address space and global address space for program scope ``const`` variables. 6107Therefore, the kernel machine code does not have to maintain the scalar cache to 6108ensure it is coherent with the vector caches. The scalar and vector caches are 6109invalidated between kernel dispatches by CP since constant address space data 6110may change between kernel dispatch executions. See 6111:ref:`amdgpu-amdhsa-memory-spaces`. 6112 6113The one exception is if scalar writes are used to spill SGPR registers. In this 6114case the AMDGPU backend ensures the memory location used to spill is never 6115accessed by vector memory operations at the same time. If scalar writes are used 6116then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6117return since the locations may be used for vector memory instructions by a 6118future wavefront that uses the same scratch area, or a function call that 6119creates a frame at the same address, respectively. There is no need for a 6120``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6121 6122For kernarg backing memory: 6123 6124* CP invalidates the L1 cache at the start of each kernel dispatch. 6125* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6126 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6127 cache. This also causes it to be treated as non-volatile and so is not 6128 invalidated by ``*_vol``. 6129* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6130 so the L2 cache will be coherent with the CPU and other agents. 6131 6132Scratch backing memory (which is used for the private address space) is accessed 6133with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6134only accessed by a single thread, and is always write-before-read, there is 6135never a need to invalidate these entries from the L1 cache. Hence all cache 6136invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6137 6138The code sequences used to implement the memory model for GFX90A are defined 6139in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6140 6141 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6142 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6143 6144 ============ ============ ============== ========== ================================ 6145 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6146 Ordering Sync Scope Address GFX90A 6147 Space 6148 ============ ============ ============== ========== ================================ 6149 **Non-Atomic** 6150 ------------------------------------------------------------------------------------ 6151 load *none* *none* - global - !volatile & !nontemporal 6152 - generic 6153 - private 1. buffer/global/flat_load 6154 - constant 6155 - !volatile & nontemporal 6156 6157 1. buffer/global/flat_load 6158 glc=1 slc=1 6159 6160 - volatile 6161 6162 1. buffer/global/flat_load 6163 glc=1 scc=1 6164 2. s_waitcnt vmcnt(0) 6165 6166 - Must happen before 6167 any following volatile 6168 global/generic 6169 load/store. 6170 - Ensures that 6171 volatile 6172 operations to 6173 different 6174 addresses will not 6175 be reordered by 6176 hardware. 6177 6178 load *none* *none* - local 1. ds_load 6179 store *none* *none* - global - !volatile & !nontemporal 6180 - generic 6181 - private 1. buffer/global/flat_store 6182 - constant 6183 - !volatile & nontemporal 6184 6185 1. buffer/global/flat_store 6186 glc=1 slc=1 6187 6188 - volatile 6189 6190 1. buffer/global/flat_store 6191 scc=1 6192 2. s_waitcnt vmcnt(0) 6193 6194 - Must happen before 6195 any following volatile 6196 global/generic 6197 load/store. 6198 - Ensures that 6199 volatile 6200 operations to 6201 different 6202 addresses will not 6203 be reordered by 6204 hardware. 6205 6206 store *none* *none* - local 1. ds_store 6207 **Unordered Atomic** 6208 ------------------------------------------------------------------------------------ 6209 load atomic unordered *any* *any* *Same as non-atomic*. 6210 store atomic unordered *any* *any* *Same as non-atomic*. 6211 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6212 **Monotonic Atomic** 6213 ------------------------------------------------------------------------------------ 6214 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6215 - wavefront - generic 6216 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6217 - generic glc=1 6218 6219 - If not TgSplit execution 6220 mode, omit glc=1. 6221 6222 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6223 - wavefront local address space cannot 6224 - workgroup be used.* 6225 6226 1. ds_load 6227 load atomic monotonic - agent - global 1. buffer/global/flat_load 6228 - generic glc=1 6229 load atomic monotonic - system - global 1. buffer/global/flat_load 6230 - generic glc=1 scc=1 6231 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6232 - wavefront - generic 6233 - workgroup 6234 - agent 6235 store atomic monotonic - system - global 1. buffer/global/flat_store 6236 - generic scc=1 6237 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6238 - wavefront local address space cannot 6239 - workgroup be used.* 6240 6241 1. ds_store 6242 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6243 - wavefront - generic 6244 - workgroup 6245 - agent 6246 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6247 - generic scc=1 6248 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6249 - wavefront local address space cannot 6250 - workgroup be used.* 6251 6252 1. ds_atomic 6253 **Acquire Atomic** 6254 ------------------------------------------------------------------------------------ 6255 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6256 - wavefront - local 6257 - generic 6258 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6259 6260 - If not TgSplit execution 6261 mode, omit glc=1. 6262 6263 2. s_waitcnt vmcnt(0) 6264 6265 - If not TgSplit execution 6266 mode, omit. 6267 - Must happen before the 6268 following buffer_wbinvl1_vol. 6269 6270 3. buffer_wbinvl1_vol 6271 6272 - If not TgSplit execution 6273 mode, omit. 6274 - Must happen before 6275 any following 6276 global/generic 6277 load/load 6278 atomic/store/store 6279 atomic/atomicrmw. 6280 - Ensures that 6281 following 6282 loads will not see 6283 stale data. 6284 6285 load atomic acquire - workgroup - local *If TgSplit execution mode, 6286 local address space cannot 6287 be used.* 6288 6289 1. ds_load 6290 2. s_waitcnt lgkmcnt(0) 6291 6292 - If OpenCL, omit. 6293 - Must happen before 6294 any following 6295 global/generic 6296 load/load 6297 atomic/store/store 6298 atomic/atomicrmw. 6299 - Ensures any 6300 following global 6301 data read is no 6302 older than the local load 6303 atomic value being 6304 acquired. 6305 6306 load atomic acquire - workgroup - generic 1. flat_load glc=1 6307 6308 - If not TgSplit execution 6309 mode, omit glc=1. 6310 6311 2. s_waitcnt lgkm/vmcnt(0) 6312 6313 - Use lgkmcnt(0) if not 6314 TgSplit execution mode 6315 and vmcnt(0) if TgSplit 6316 execution mode. 6317 - If OpenCL, omit lgkmcnt(0). 6318 - Must happen before 6319 the following 6320 buffer_wbinvl1_vol and any 6321 following global/generic 6322 load/load 6323 atomic/store/store 6324 atomic/atomicrmw. 6325 - Ensures any 6326 following global 6327 data read is no 6328 older than a local load 6329 atomic value being 6330 acquired. 6331 6332 3. buffer_wbinvl1_vol 6333 6334 - If not TgSplit execution 6335 mode, omit. 6336 - Ensures that 6337 following 6338 loads will not see 6339 stale data. 6340 6341 load atomic acquire - agent - global 1. buffer/global_load 6342 glc=1 6343 2. s_waitcnt vmcnt(0) 6344 6345 - Must happen before 6346 following 6347 buffer_wbinvl1_vol. 6348 - Ensures the load 6349 has completed 6350 before invalidating 6351 the cache. 6352 6353 3. buffer_wbinvl1_vol 6354 6355 - Must happen before 6356 any following 6357 global/generic 6358 load/load 6359 atomic/atomicrmw. 6360 - Ensures that 6361 following 6362 loads will not see 6363 stale global data. 6364 6365 load atomic acquire - system - global 1. buffer/global/flat_load 6366 glc=1 scc=1 6367 2. s_waitcnt vmcnt(0) 6368 6369 - Must happen before 6370 following buffer_invl2 and 6371 buffer_wbinvl1_vol. 6372 - Ensures the load 6373 has completed 6374 before invalidating 6375 the cache. 6376 6377 3. buffer_invl2; 6378 buffer_wbinvl1_vol 6379 6380 - Must happen before 6381 any following 6382 global/generic 6383 load/load 6384 atomic/atomicrmw. 6385 - Ensures that 6386 following 6387 loads will not see 6388 stale MTYPE NC global data. 6389 MTYPE RW and CC memory will 6390 never be stale due to the 6391 memory probes. 6392 6393 load atomic acquire - agent - generic 1. flat_load glc=1 6394 2. s_waitcnt vmcnt(0) & 6395 lgkmcnt(0) 6396 6397 - If TgSplit execution mode, 6398 omit lgkmcnt(0). 6399 - If OpenCL omit 6400 lgkmcnt(0). 6401 - Must happen before 6402 following 6403 buffer_wbinvl1_vol. 6404 - Ensures the flat_load 6405 has completed 6406 before invalidating 6407 the cache. 6408 6409 3. buffer_wbinvl1_vol 6410 6411 - Must happen before 6412 any following 6413 global/generic 6414 load/load 6415 atomic/atomicrmw. 6416 - Ensures that 6417 following loads 6418 will not see stale 6419 global data. 6420 6421 load atomic acquire - system - generic 1. flat_load glc=1 scc=1 6422 2. s_waitcnt vmcnt(0) & 6423 lgkmcnt(0) 6424 6425 - If TgSplit execution mode, 6426 omit lgkmcnt(0). 6427 - If OpenCL omit 6428 lgkmcnt(0). 6429 - Must happen before 6430 following 6431 buffer_invl2 and 6432 buffer_wbinvl1_vol. 6433 - Ensures the flat_load 6434 has completed 6435 before invalidating 6436 the caches. 6437 6438 3. buffer_invl2; 6439 buffer_wbinvl1_vol 6440 6441 - Must happen before 6442 any following 6443 global/generic 6444 load/load 6445 atomic/atomicrmw. 6446 - Ensures that 6447 following 6448 loads will not see 6449 stale MTYPE NC global data. 6450 MTYPE RW and CC memory will 6451 never be stale due to the 6452 memory probes. 6453 6454 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6455 - wavefront - generic 6456 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6457 - wavefront local address space cannot 6458 be used.* 6459 6460 1. ds_atomic 6461 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6462 2. s_waitcnt vmcnt(0) 6463 6464 - If not TgSplit execution 6465 mode, omit. 6466 - Must happen before the 6467 following buffer_wbinvl1_vol. 6468 - Ensures the atomicrmw 6469 has completed 6470 before invalidating 6471 the cache. 6472 6473 3. buffer_wbinvl1_vol 6474 6475 - If not TgSplit execution 6476 mode, omit. 6477 - Must happen before 6478 any following 6479 global/generic 6480 load/load 6481 atomic/atomicrmw. 6482 - Ensures that 6483 following loads 6484 will not see stale 6485 global data. 6486 6487 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6488 local address space cannot 6489 be used.* 6490 6491 1. ds_atomic 6492 2. s_waitcnt lgkmcnt(0) 6493 6494 - If OpenCL, omit. 6495 - Must happen before 6496 any following 6497 global/generic 6498 load/load 6499 atomic/store/store 6500 atomic/atomicrmw. 6501 - Ensures any 6502 following global 6503 data read is no 6504 older than the local 6505 atomicrmw value 6506 being acquired. 6507 6508 atomicrmw acquire - workgroup - generic 1. flat_atomic 6509 2. s_waitcnt lgkm/vmcnt(0) 6510 6511 - Use lgkmcnt(0) if not 6512 TgSplit execution mode 6513 and vmcnt(0) if TgSplit 6514 execution mode. 6515 - If OpenCL, omit lgkmcnt(0). 6516 - Must happen before 6517 the following 6518 buffer_wbinvl1_vol and 6519 any following 6520 global/generic 6521 load/load 6522 atomic/store/store 6523 atomic/atomicrmw. 6524 - Ensures any 6525 following global 6526 data read is no 6527 older than a local 6528 atomicrmw value 6529 being acquired. 6530 6531 3. buffer_wbinvl1_vol 6532 6533 - If not TgSplit execution 6534 mode, omit. 6535 - Ensures that 6536 following 6537 loads will not see 6538 stale data. 6539 6540 atomicrmw acquire - agent - global 1. buffer/global_atomic 6541 2. s_waitcnt vmcnt(0) 6542 6543 - Must happen before 6544 following 6545 buffer_wbinvl1_vol. 6546 - Ensures the 6547 atomicrmw has 6548 completed before 6549 invalidating the 6550 cache. 6551 6552 3. buffer_wbinvl1_vol 6553 6554 - Must happen before 6555 any following 6556 global/generic 6557 load/load 6558 atomic/atomicrmw. 6559 - Ensures that 6560 following loads 6561 will not see stale 6562 global data. 6563 6564 atomicrmw acquire - system - global 1. buffer/global_atomic 6565 scc=1 6566 2. s_waitcnt vmcnt(0) 6567 6568 - Must happen before 6569 following buffer_invl2 and 6570 buffer_wbinvl1_vol. 6571 - Ensures the 6572 atomicrmw has 6573 completed before 6574 invalidating the 6575 caches. 6576 6577 3. buffer_invl2; 6578 buffer_wbinvl1_vol 6579 6580 - Must happen before 6581 any following 6582 global/generic 6583 load/load 6584 atomic/atomicrmw. 6585 - Ensures that 6586 following 6587 loads will not see 6588 stale MTYPE NC global data. 6589 MTYPE RW and CC memory will 6590 never be stale due to the 6591 memory probes. 6592 6593 atomicrmw acquire - agent - generic 1. flat_atomic 6594 2. s_waitcnt vmcnt(0) & 6595 lgkmcnt(0) 6596 6597 - If TgSplit execution mode, 6598 omit lgkmcnt(0). 6599 - If OpenCL, omit 6600 lgkmcnt(0). 6601 - Must happen before 6602 following 6603 buffer_wbinvl1_vol. 6604 - Ensures the 6605 atomicrmw has 6606 completed before 6607 invalidating the 6608 cache. 6609 6610 3. buffer_wbinvl1_vol 6611 6612 - Must happen before 6613 any following 6614 global/generic 6615 load/load 6616 atomic/atomicrmw. 6617 - Ensures that 6618 following loads 6619 will not see stale 6620 global data. 6621 6622 atomicrmw acquire - system - generic 1. flat_atomic scc=1 6623 2. s_waitcnt vmcnt(0) & 6624 lgkmcnt(0) 6625 6626 - If TgSplit execution mode, 6627 omit lgkmcnt(0). 6628 - If OpenCL, omit 6629 lgkmcnt(0). 6630 - Must happen before 6631 following 6632 buffer_invl2 and 6633 buffer_wbinvl1_vol. 6634 - Ensures the 6635 atomicrmw has 6636 completed before 6637 invalidating the 6638 caches. 6639 6640 3. buffer_invl2; 6641 buffer_wbinvl1_vol 6642 6643 - Must happen before 6644 any following 6645 global/generic 6646 load/load 6647 atomic/atomicrmw. 6648 - Ensures that 6649 following 6650 loads will not see 6651 stale MTYPE NC global data. 6652 MTYPE RW and CC memory will 6653 never be stale due to the 6654 memory probes. 6655 6656 fence acquire - singlethread *none* *none* 6657 - wavefront 6658 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 6659 6660 - Use lgkmcnt(0) if not 6661 TgSplit execution mode 6662 and vmcnt(0) if TgSplit 6663 execution mode. 6664 - If OpenCL and 6665 address space is 6666 not generic, omit 6667 lgkmcnt(0). 6668 - If OpenCL and 6669 address space is 6670 local, omit 6671 vmcnt(0). 6672 - However, since LLVM 6673 currently has no 6674 address space on 6675 the fence need to 6676 conservatively 6677 always generate. If 6678 fence had an 6679 address space then 6680 set to address 6681 space of OpenCL 6682 fence flag, or to 6683 generic if both 6684 local and global 6685 flags are 6686 specified. 6687 - s_waitcnt vmcnt(0) 6688 must happen after 6689 any preceding 6690 global/generic load 6691 atomic/ 6692 atomicrmw 6693 with an equal or 6694 wider sync scope 6695 and memory ordering 6696 stronger than 6697 unordered (this is 6698 termed the 6699 fence-paired-atomic). 6700 - s_waitcnt lgkmcnt(0) 6701 must happen after 6702 any preceding 6703 local/generic load 6704 atomic/atomicrmw 6705 with an equal or 6706 wider sync scope 6707 and memory ordering 6708 stronger than 6709 unordered (this is 6710 termed the 6711 fence-paired-atomic). 6712 - Must happen before 6713 the following 6714 buffer_wbinvl1_vol and 6715 any following 6716 global/generic 6717 load/load 6718 atomic/store/store 6719 atomic/atomicrmw. 6720 - Ensures any 6721 following global 6722 data read is no 6723 older than the 6724 value read by the 6725 fence-paired-atomic. 6726 6727 3. buffer_wbinvl1_vol 6728 6729 - If not TgSplit execution 6730 mode, omit. 6731 - Ensures that 6732 following 6733 loads will not see 6734 stale data. 6735 6736 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6737 vmcnt(0) 6738 6739 - If TgSplit execution mode, 6740 omit lgkmcnt(0). 6741 - If OpenCL and 6742 address space is 6743 not generic, omit 6744 lgkmcnt(0). 6745 - However, since LLVM 6746 currently has no 6747 address space on 6748 the fence need to 6749 conservatively 6750 always generate 6751 (see comment for 6752 previous fence). 6753 - Could be split into 6754 separate s_waitcnt 6755 vmcnt(0) and 6756 s_waitcnt 6757 lgkmcnt(0) to allow 6758 them to be 6759 independently moved 6760 according to the 6761 following rules. 6762 - s_waitcnt vmcnt(0) 6763 must happen after 6764 any preceding 6765 global/generic load 6766 atomic/atomicrmw 6767 with an equal or 6768 wider sync scope 6769 and memory ordering 6770 stronger than 6771 unordered (this is 6772 termed the 6773 fence-paired-atomic). 6774 - s_waitcnt lgkmcnt(0) 6775 must happen after 6776 any preceding 6777 local/generic load 6778 atomic/atomicrmw 6779 with an equal or 6780 wider sync scope 6781 and memory ordering 6782 stronger than 6783 unordered (this is 6784 termed the 6785 fence-paired-atomic). 6786 - Must happen before 6787 the following 6788 buffer_wbinvl1_vol. 6789 - Ensures that the 6790 fence-paired atomic 6791 has completed 6792 before invalidating 6793 the 6794 cache. Therefore 6795 any following 6796 locations read must 6797 be no older than 6798 the value read by 6799 the 6800 fence-paired-atomic. 6801 6802 2. buffer_wbinvl1_vol 6803 6804 - Must happen before any 6805 following global/generic 6806 load/load 6807 atomic/store/store 6808 atomic/atomicrmw. 6809 - Ensures that 6810 following loads 6811 will not see stale 6812 global data. 6813 6814 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 6815 vmcnt(0) 6816 6817 - If TgSplit execution mode, 6818 omit lgkmcnt(0). 6819 - If OpenCL and 6820 address space is 6821 not generic, omit 6822 lgkmcnt(0). 6823 - However, since LLVM 6824 currently has no 6825 address space on 6826 the fence need to 6827 conservatively 6828 always generate 6829 (see comment for 6830 previous fence). 6831 - Could be split into 6832 separate s_waitcnt 6833 vmcnt(0) and 6834 s_waitcnt 6835 lgkmcnt(0) to allow 6836 them to be 6837 independently moved 6838 according to the 6839 following rules. 6840 - s_waitcnt vmcnt(0) 6841 must happen after 6842 any preceding 6843 global/generic load 6844 atomic/atomicrmw 6845 with an equal or 6846 wider sync scope 6847 and memory ordering 6848 stronger than 6849 unordered (this is 6850 termed the 6851 fence-paired-atomic). 6852 - s_waitcnt lgkmcnt(0) 6853 must happen after 6854 any preceding 6855 local/generic load 6856 atomic/atomicrmw 6857 with an equal or 6858 wider sync scope 6859 and memory ordering 6860 stronger than 6861 unordered (this is 6862 termed the 6863 fence-paired-atomic). 6864 - Must happen before 6865 the following buffer_invl2 and 6866 buffer_wbinvl1_vol. 6867 - Ensures that the 6868 fence-paired atomic 6869 has completed 6870 before invalidating 6871 the 6872 cache. Therefore 6873 any following 6874 locations read must 6875 be no older than 6876 the value read by 6877 the 6878 fence-paired-atomic. 6879 6880 2. buffer_invl2; 6881 buffer_wbinvl1_vol 6882 6883 - Must happen before any 6884 following global/generic 6885 load/load 6886 atomic/store/store 6887 atomic/atomicrmw. 6888 - Ensures that 6889 following loads 6890 will not see stale 6891 global data. 6892 6893 **Release Atomic** 6894 ------------------------------------------------------------------------------------ 6895 store atomic release - singlethread - global 1. buffer/global/flat_store 6896 - wavefront - generic 6897 store atomic release - singlethread - local *If TgSplit execution mode, 6898 - wavefront local address space cannot 6899 be used.* 6900 6901 1. ds_store 6902 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 6903 - generic 6904 - Use lgkmcnt(0) if not 6905 TgSplit execution mode 6906 and vmcnt(0) if TgSplit 6907 execution mode. 6908 - If OpenCL, omit lgkmcnt(0). 6909 - s_waitcnt vmcnt(0) 6910 must happen after 6911 any preceding 6912 global/generic load/store/ 6913 load atomic/store atomic/ 6914 atomicrmw. 6915 - s_waitcnt lgkmcnt(0) 6916 must happen after 6917 any preceding 6918 local/generic 6919 load/store/load 6920 atomic/store 6921 atomic/atomicrmw. 6922 - Must happen before 6923 the following 6924 store. 6925 - Ensures that all 6926 memory operations 6927 have 6928 completed before 6929 performing the 6930 store that is being 6931 released. 6932 6933 2. buffer/global/flat_store 6934 store atomic release - workgroup - local *If TgSplit execution mode, 6935 local address space cannot 6936 be used.* 6937 6938 1. ds_store 6939 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 6940 - generic vmcnt(0) 6941 6942 - If TgSplit execution mode, 6943 omit lgkmcnt(0). 6944 - If OpenCL and 6945 address space is 6946 not generic, omit 6947 lgkmcnt(0). 6948 - Could be split into 6949 separate s_waitcnt 6950 vmcnt(0) and 6951 s_waitcnt 6952 lgkmcnt(0) to allow 6953 them to be 6954 independently moved 6955 according to the 6956 following rules. 6957 - s_waitcnt vmcnt(0) 6958 must happen after 6959 any preceding 6960 global/generic 6961 load/store/load 6962 atomic/store 6963 atomic/atomicrmw. 6964 - s_waitcnt lgkmcnt(0) 6965 must happen after 6966 any preceding 6967 local/generic 6968 load/store/load 6969 atomic/store 6970 atomic/atomicrmw. 6971 - Must happen before 6972 the following 6973 store. 6974 - Ensures that all 6975 memory operations 6976 to memory have 6977 completed before 6978 performing the 6979 store that is being 6980 released. 6981 6982 2. buffer/global/flat_store 6983 store atomic release - system - global 1. buffer_wbl2 6984 - generic 6985 - Must happen before 6986 following s_waitcnt. 6987 - Performs L2 writeback to 6988 ensure previous 6989 global/generic 6990 store/atomicrmw are 6991 visible at system scope. 6992 6993 2. s_waitcnt lgkmcnt(0) & 6994 vmcnt(0) 6995 6996 - If TgSplit execution mode, 6997 omit lgkmcnt(0). 6998 - If OpenCL and 6999 address space is 7000 not generic, omit 7001 lgkmcnt(0). 7002 - Could be split into 7003 separate s_waitcnt 7004 vmcnt(0) and 7005 s_waitcnt 7006 lgkmcnt(0) to allow 7007 them to be 7008 independently moved 7009 according to the 7010 following rules. 7011 - s_waitcnt vmcnt(0) 7012 must happen after any 7013 preceding 7014 global/generic 7015 load/store/load 7016 atomic/store 7017 atomic/atomicrmw. 7018 - s_waitcnt lgkmcnt(0) 7019 must happen after any 7020 preceding 7021 local/generic 7022 load/store/load 7023 atomic/store 7024 atomic/atomicrmw. 7025 - Must happen before 7026 the following 7027 store. 7028 - Ensures that all 7029 memory operations 7030 to memory and the L2 7031 writeback have 7032 completed before 7033 performing the 7034 store that is being 7035 released. 7036 7037 2. buffer/global/flat_store 7038 scc=1 7039 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7040 - wavefront - generic 7041 atomicrmw release - singlethread - local *If TgSplit execution mode, 7042 - wavefront local address space cannot 7043 be used.* 7044 7045 1. ds_atomic 7046 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7047 - generic 7048 - Use lgkmcnt(0) if not 7049 TgSplit execution mode 7050 and vmcnt(0) if TgSplit 7051 execution mode. 7052 - If OpenCL, omit 7053 lgkmcnt(0). 7054 - s_waitcnt vmcnt(0) 7055 must happen after 7056 any preceding 7057 global/generic load/store/ 7058 load atomic/store atomic/ 7059 atomicrmw. 7060 - s_waitcnt lgkmcnt(0) 7061 must happen after 7062 any preceding 7063 local/generic 7064 load/store/load 7065 atomic/store 7066 atomic/atomicrmw. 7067 - Must happen before 7068 the following 7069 atomicrmw. 7070 - Ensures that all 7071 memory operations 7072 have 7073 completed before 7074 performing the 7075 atomicrmw that is 7076 being released. 7077 7078 2. buffer/global/flat_atomic 7079 atomicrmw release - workgroup - local *If TgSplit execution mode, 7080 local address space cannot 7081 be used.* 7082 7083 1. ds_atomic 7084 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7085 - generic vmcnt(0) 7086 7087 - If TgSplit execution mode, 7088 omit lgkmcnt(0). 7089 - If OpenCL, omit 7090 lgkmcnt(0). 7091 - Could be split into 7092 separate s_waitcnt 7093 vmcnt(0) and 7094 s_waitcnt 7095 lgkmcnt(0) to allow 7096 them to be 7097 independently moved 7098 according to the 7099 following rules. 7100 - s_waitcnt vmcnt(0) 7101 must happen after 7102 any preceding 7103 global/generic 7104 load/store/load 7105 atomic/store 7106 atomic/atomicrmw. 7107 - s_waitcnt lgkmcnt(0) 7108 must happen after 7109 any preceding 7110 local/generic 7111 load/store/load 7112 atomic/store 7113 atomic/atomicrmw. 7114 - Must happen before 7115 the following 7116 atomicrmw. 7117 - Ensures that all 7118 memory operations 7119 to global and local 7120 have completed 7121 before performing 7122 the atomicrmw that 7123 is being released. 7124 7125 2. buffer/global/flat_atomic 7126 atomicrmw release - system - global 1. buffer_wbl2 7127 - generic 7128 - Must happen before 7129 following s_waitcnt. 7130 - Performs L2 writeback to 7131 ensure previous 7132 global/generic 7133 store/atomicrmw are 7134 visible at system scope. 7135 7136 2. s_waitcnt lgkmcnt(0) & 7137 vmcnt(0) 7138 7139 - If TgSplit execution mode, 7140 omit lgkmcnt(0). 7141 - If OpenCL, omit 7142 lgkmcnt(0). 7143 - Could be split into 7144 separate s_waitcnt 7145 vmcnt(0) and 7146 s_waitcnt 7147 lgkmcnt(0) to allow 7148 them to be 7149 independently moved 7150 according to the 7151 following rules. 7152 - s_waitcnt vmcnt(0) 7153 must happen after 7154 any preceding 7155 global/generic 7156 load/store/load 7157 atomic/store 7158 atomic/atomicrmw. 7159 - s_waitcnt lgkmcnt(0) 7160 must happen after 7161 any preceding 7162 local/generic 7163 load/store/load 7164 atomic/store 7165 atomic/atomicrmw. 7166 - Must happen before 7167 the following 7168 atomicrmw. 7169 - Ensures that all 7170 memory operations 7171 to memory and the L2 7172 writeback have 7173 completed before 7174 performing the 7175 store that is being 7176 released. 7177 7178 3. buffer/global/flat_atomic 7179 scc=1 7180 fence release - singlethread *none* *none* 7181 - wavefront 7182 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7183 7184 - Use lgkmcnt(0) if not 7185 TgSplit execution mode 7186 and vmcnt(0) if TgSplit 7187 execution mode. 7188 - If OpenCL and 7189 address space is 7190 not generic, omit 7191 lgkmcnt(0). 7192 - If OpenCL and 7193 address space is 7194 local, omit 7195 vmcnt(0). 7196 - However, since LLVM 7197 currently has no 7198 address space on 7199 the fence need to 7200 conservatively 7201 always generate. If 7202 fence had an 7203 address space then 7204 set to address 7205 space of OpenCL 7206 fence flag, or to 7207 generic if both 7208 local and global 7209 flags are 7210 specified. 7211 - s_waitcnt vmcnt(0) 7212 must happen after 7213 any preceding 7214 global/generic 7215 load/store/ 7216 load atomic/store atomic/ 7217 atomicrmw. 7218 - s_waitcnt lgkmcnt(0) 7219 must happen after 7220 any preceding 7221 local/generic 7222 load/load 7223 atomic/store/store 7224 atomic/atomicrmw. 7225 - Must happen before 7226 any following store 7227 atomic/atomicrmw 7228 with an equal or 7229 wider sync scope 7230 and memory ordering 7231 stronger than 7232 unordered (this is 7233 termed the 7234 fence-paired-atomic). 7235 - Ensures that all 7236 memory operations 7237 have 7238 completed before 7239 performing the 7240 following 7241 fence-paired-atomic. 7242 7243 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7244 vmcnt(0) 7245 7246 - If TgSplit execution mode, 7247 omit lgkmcnt(0). 7248 - If OpenCL and 7249 address space is 7250 not generic, omit 7251 lgkmcnt(0). 7252 - If OpenCL and 7253 address space is 7254 local, omit 7255 vmcnt(0). 7256 - However, since LLVM 7257 currently has no 7258 address space on 7259 the fence need to 7260 conservatively 7261 always generate. If 7262 fence had an 7263 address space then 7264 set to address 7265 space of OpenCL 7266 fence flag, or to 7267 generic if both 7268 local and global 7269 flags are 7270 specified. 7271 - Could be split into 7272 separate s_waitcnt 7273 vmcnt(0) and 7274 s_waitcnt 7275 lgkmcnt(0) to allow 7276 them to be 7277 independently moved 7278 according to the 7279 following rules. 7280 - s_waitcnt vmcnt(0) 7281 must happen after 7282 any preceding 7283 global/generic 7284 load/store/load 7285 atomic/store 7286 atomic/atomicrmw. 7287 - s_waitcnt lgkmcnt(0) 7288 must happen after 7289 any preceding 7290 local/generic 7291 load/store/load 7292 atomic/store 7293 atomic/atomicrmw. 7294 - Must happen before 7295 any following store 7296 atomic/atomicrmw 7297 with an equal or 7298 wider sync scope 7299 and memory ordering 7300 stronger than 7301 unordered (this is 7302 termed the 7303 fence-paired-atomic). 7304 - Ensures that all 7305 memory operations 7306 have 7307 completed before 7308 performing the 7309 following 7310 fence-paired-atomic. 7311 7312 fence release - system *none* 1. buffer_wbl2 7313 7314 - If OpenCL and 7315 address space is 7316 local, omit. 7317 - Must happen before 7318 following s_waitcnt. 7319 - Performs L2 writeback to 7320 ensure previous 7321 global/generic 7322 store/atomicrmw are 7323 visible at system scope. 7324 7325 2. s_waitcnt lgkmcnt(0) & 7326 vmcnt(0) 7327 7328 - If TgSplit execution mode, 7329 omit lgkmcnt(0). 7330 - If OpenCL and 7331 address space is 7332 not generic, omit 7333 lgkmcnt(0). 7334 - If OpenCL and 7335 address space is 7336 local, omit 7337 vmcnt(0). 7338 - However, since LLVM 7339 currently has no 7340 address space on 7341 the fence need to 7342 conservatively 7343 always generate. If 7344 fence had an 7345 address space then 7346 set to address 7347 space of OpenCL 7348 fence flag, or to 7349 generic if both 7350 local and global 7351 flags are 7352 specified. 7353 - Could be split into 7354 separate s_waitcnt 7355 vmcnt(0) and 7356 s_waitcnt 7357 lgkmcnt(0) to allow 7358 them to be 7359 independently moved 7360 according to the 7361 following rules. 7362 - s_waitcnt vmcnt(0) 7363 must happen after 7364 any preceding 7365 global/generic 7366 load/store/load 7367 atomic/store 7368 atomic/atomicrmw. 7369 - s_waitcnt lgkmcnt(0) 7370 must happen after 7371 any preceding 7372 local/generic 7373 load/store/load 7374 atomic/store 7375 atomic/atomicrmw. 7376 - Must happen before 7377 any following store 7378 atomic/atomicrmw 7379 with an equal or 7380 wider sync scope 7381 and memory ordering 7382 stronger than 7383 unordered (this is 7384 termed the 7385 fence-paired-atomic). 7386 - Ensures that all 7387 memory operations 7388 have 7389 completed before 7390 performing the 7391 following 7392 fence-paired-atomic. 7393 7394 **Acquire-Release Atomic** 7395 ------------------------------------------------------------------------------------ 7396 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7397 - wavefront - generic 7398 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7399 - wavefront local address space cannot 7400 be used.* 7401 7402 1. ds_atomic 7403 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7404 7405 - Use lgkmcnt(0) if not 7406 TgSplit execution mode 7407 and vmcnt(0) if TgSplit 7408 execution mode. 7409 - If OpenCL, omit 7410 lgkmcnt(0). 7411 - Must happen after 7412 any preceding 7413 local/generic 7414 load/store/load 7415 atomic/store 7416 atomic/atomicrmw. 7417 - s_waitcnt vmcnt(0) 7418 must happen after 7419 any preceding 7420 global/generic load/store/ 7421 load atomic/store atomic/ 7422 atomicrmw. 7423 - s_waitcnt lgkmcnt(0) 7424 must happen after 7425 any preceding 7426 local/generic 7427 load/store/load 7428 atomic/store 7429 atomic/atomicrmw. 7430 - Must happen before 7431 the following 7432 atomicrmw. 7433 - Ensures that all 7434 memory operations 7435 have 7436 completed before 7437 performing the 7438 atomicrmw that is 7439 being released. 7440 7441 2. buffer/global_atomic 7442 3. s_waitcnt vmcnt(0) 7443 7444 - If not TgSplit execution 7445 mode, omit. 7446 - Must happen before 7447 the following 7448 buffer_wbinvl1_vol. 7449 - Ensures any 7450 following global 7451 data read is no 7452 older than the 7453 atomicrmw value 7454 being acquired. 7455 7456 4. buffer_wbinvl1_vol 7457 7458 - If not TgSplit execution 7459 mode, omit. 7460 - Ensures that 7461 following 7462 loads will not see 7463 stale data. 7464 7465 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7466 local address space cannot 7467 be used.* 7468 7469 1. ds_atomic 7470 2. s_waitcnt lgkmcnt(0) 7471 7472 - If OpenCL, omit. 7473 - Must happen before 7474 any following 7475 global/generic 7476 load/load 7477 atomic/store/store 7478 atomic/atomicrmw. 7479 - Ensures any 7480 following global 7481 data read is no 7482 older than the local load 7483 atomic value being 7484 acquired. 7485 7486 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7487 7488 - Use lgkmcnt(0) if not 7489 TgSplit execution mode 7490 and vmcnt(0) if TgSplit 7491 execution mode. 7492 - If OpenCL, omit 7493 lgkmcnt(0). 7494 - s_waitcnt vmcnt(0) 7495 must happen after 7496 any preceding 7497 global/generic load/store/ 7498 load atomic/store atomic/ 7499 atomicrmw. 7500 - s_waitcnt lgkmcnt(0) 7501 must happen after 7502 any preceding 7503 local/generic 7504 load/store/load 7505 atomic/store 7506 atomic/atomicrmw. 7507 - Must happen before 7508 the following 7509 atomicrmw. 7510 - Ensures that all 7511 memory operations 7512 have 7513 completed before 7514 performing the 7515 atomicrmw that is 7516 being released. 7517 7518 2. flat_atomic 7519 3. s_waitcnt lgkmcnt(0) & 7520 vmcnt(0) 7521 7522 - If not TgSplit execution 7523 mode, omit vmcnt(0). 7524 - If OpenCL, omit 7525 lgkmcnt(0). 7526 - Must happen before 7527 the following 7528 buffer_wbinvl1_vol and 7529 any following 7530 global/generic 7531 load/load 7532 atomic/store/store 7533 atomic/atomicrmw. 7534 - Ensures any 7535 following global 7536 data read is no 7537 older than a local load 7538 atomic value being 7539 acquired. 7540 7541 3. buffer_wbinvl1_vol 7542 7543 - If not TgSplit execution 7544 mode, omit. 7545 - Ensures that 7546 following 7547 loads will not see 7548 stale data. 7549 7550 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7551 vmcnt(0) 7552 7553 - If TgSplit execution mode, 7554 omit lgkmcnt(0). 7555 - If OpenCL, omit 7556 lgkmcnt(0). 7557 - Could be split into 7558 separate s_waitcnt 7559 vmcnt(0) and 7560 s_waitcnt 7561 lgkmcnt(0) to allow 7562 them to be 7563 independently moved 7564 according to the 7565 following rules. 7566 - s_waitcnt vmcnt(0) 7567 must happen after 7568 any preceding 7569 global/generic 7570 load/store/load 7571 atomic/store 7572 atomic/atomicrmw. 7573 - s_waitcnt lgkmcnt(0) 7574 must happen after 7575 any preceding 7576 local/generic 7577 load/store/load 7578 atomic/store 7579 atomic/atomicrmw. 7580 - Must happen before 7581 the following 7582 atomicrmw. 7583 - Ensures that all 7584 memory operations 7585 to global have 7586 completed before 7587 performing the 7588 atomicrmw that is 7589 being released. 7590 7591 2. buffer/global_atomic 7592 3. s_waitcnt vmcnt(0) 7593 7594 - Must happen before 7595 following 7596 buffer_wbinvl1_vol. 7597 - Ensures the 7598 atomicrmw has 7599 completed before 7600 invalidating the 7601 cache. 7602 7603 4. buffer_wbinvl1_vol 7604 7605 - Must happen before 7606 any following 7607 global/generic 7608 load/load 7609 atomic/atomicrmw. 7610 - Ensures that 7611 following loads 7612 will not see stale 7613 global data. 7614 7615 atomicrmw acq_rel - system - global 1. buffer_wbl2 7616 7617 - Must happen before 7618 following s_waitcnt. 7619 - Performs L2 writeback to 7620 ensure previous 7621 global/generic 7622 store/atomicrmw are 7623 visible at system scope. 7624 7625 2. s_waitcnt lgkmcnt(0) & 7626 vmcnt(0) 7627 7628 - If TgSplit execution mode, 7629 omit lgkmcnt(0). 7630 - If OpenCL, omit 7631 lgkmcnt(0). 7632 - Could be split into 7633 separate s_waitcnt 7634 vmcnt(0) and 7635 s_waitcnt 7636 lgkmcnt(0) to allow 7637 them to be 7638 independently moved 7639 according to the 7640 following rules. 7641 - s_waitcnt vmcnt(0) 7642 must happen after 7643 any preceding 7644 global/generic 7645 load/store/load 7646 atomic/store 7647 atomic/atomicrmw. 7648 - s_waitcnt lgkmcnt(0) 7649 must happen after 7650 any preceding 7651 local/generic 7652 load/store/load 7653 atomic/store 7654 atomic/atomicrmw. 7655 - Must happen before 7656 the following 7657 atomicrmw. 7658 - Ensures that all 7659 memory operations 7660 to global and L2 writeback 7661 have completed before 7662 performing the 7663 atomicrmw that is 7664 being released. 7665 7666 3. buffer/global_atomic 7667 scc=1 7668 4. s_waitcnt vmcnt(0) 7669 7670 - Must happen before 7671 following buffer_invl2 and 7672 buffer_wbinvl1_vol. 7673 - Ensures the 7674 atomicrmw has 7675 completed before 7676 invalidating the 7677 caches. 7678 7679 5. buffer_invl2; 7680 buffer_wbinvl1_vol 7681 7682 - Must happen before 7683 any following 7684 global/generic 7685 load/load 7686 atomic/atomicrmw. 7687 - Ensures that 7688 following loads 7689 will not see stale 7690 MTYPE NC global data. 7691 MTYPE RW and CC memory will 7692 never be stale due to the 7693 memory probes. 7694 7695 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7696 vmcnt(0) 7697 7698 - If TgSplit execution mode, 7699 omit lgkmcnt(0). 7700 - If OpenCL, omit 7701 lgkmcnt(0). 7702 - Could be split into 7703 separate s_waitcnt 7704 vmcnt(0) and 7705 s_waitcnt 7706 lgkmcnt(0) to allow 7707 them to be 7708 independently moved 7709 according to the 7710 following rules. 7711 - s_waitcnt vmcnt(0) 7712 must happen after 7713 any preceding 7714 global/generic 7715 load/store/load 7716 atomic/store 7717 atomic/atomicrmw. 7718 - s_waitcnt lgkmcnt(0) 7719 must happen after 7720 any preceding 7721 local/generic 7722 load/store/load 7723 atomic/store 7724 atomic/atomicrmw. 7725 - Must happen before 7726 the following 7727 atomicrmw. 7728 - Ensures that all 7729 memory operations 7730 to global have 7731 completed before 7732 performing the 7733 atomicrmw that is 7734 being released. 7735 7736 2. flat_atomic 7737 3. s_waitcnt vmcnt(0) & 7738 lgkmcnt(0) 7739 7740 - If TgSplit execution mode, 7741 omit lgkmcnt(0). 7742 - If OpenCL, omit 7743 lgkmcnt(0). 7744 - Must happen before 7745 following 7746 buffer_wbinvl1_vol. 7747 - Ensures the 7748 atomicrmw has 7749 completed before 7750 invalidating the 7751 cache. 7752 7753 4. buffer_wbinvl1_vol 7754 7755 - Must happen before 7756 any following 7757 global/generic 7758 load/load 7759 atomic/atomicrmw. 7760 - Ensures that 7761 following loads 7762 will not see stale 7763 global data. 7764 7765 atomicrmw acq_rel - system - generic 1. buffer_wbl2 7766 7767 - Must happen before 7768 following s_waitcnt. 7769 - Performs L2 writeback to 7770 ensure previous 7771 global/generic 7772 store/atomicrmw are 7773 visible at system scope. 7774 7775 2. s_waitcnt lgkmcnt(0) & 7776 vmcnt(0) 7777 7778 - If TgSplit execution mode, 7779 omit lgkmcnt(0). 7780 - If OpenCL, omit 7781 lgkmcnt(0). 7782 - Could be split into 7783 separate s_waitcnt 7784 vmcnt(0) and 7785 s_waitcnt 7786 lgkmcnt(0) to allow 7787 them to be 7788 independently moved 7789 according to the 7790 following rules. 7791 - s_waitcnt vmcnt(0) 7792 must happen after 7793 any preceding 7794 global/generic 7795 load/store/load 7796 atomic/store 7797 atomic/atomicrmw. 7798 - s_waitcnt lgkmcnt(0) 7799 must happen after 7800 any preceding 7801 local/generic 7802 load/store/load 7803 atomic/store 7804 atomic/atomicrmw. 7805 - Must happen before 7806 the following 7807 atomicrmw. 7808 - Ensures that all 7809 memory operations 7810 to global and L2 writeback 7811 have completed before 7812 performing the 7813 atomicrmw that is 7814 being released. 7815 7816 3. flat_atomic scc=1 7817 4. s_waitcnt vmcnt(0) & 7818 lgkmcnt(0) 7819 7820 - If TgSplit execution mode, 7821 omit lgkmcnt(0). 7822 - If OpenCL, omit 7823 lgkmcnt(0). 7824 - Must happen before 7825 following buffer_invl2 and 7826 buffer_wbinvl1_vol. 7827 - Ensures the 7828 atomicrmw has 7829 completed before 7830 invalidating the 7831 caches. 7832 7833 5. buffer_invl2; 7834 buffer_wbinvl1_vol 7835 7836 - Must happen before 7837 any following 7838 global/generic 7839 load/load 7840 atomic/atomicrmw. 7841 - Ensures that 7842 following loads 7843 will not see stale 7844 MTYPE NC global data. 7845 MTYPE RW and CC memory will 7846 never be stale due to the 7847 memory probes. 7848 7849 fence acq_rel - singlethread *none* *none* 7850 - wavefront 7851 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7852 7853 - Use lgkmcnt(0) if not 7854 TgSplit execution mode 7855 and vmcnt(0) if TgSplit 7856 execution mode. 7857 - If OpenCL and 7858 address space is 7859 not generic, omit 7860 lgkmcnt(0). 7861 - If OpenCL and 7862 address space is 7863 local, omit 7864 vmcnt(0). 7865 - However, 7866 since LLVM 7867 currently has no 7868 address space on 7869 the fence need to 7870 conservatively 7871 always generate 7872 (see comment for 7873 previous fence). 7874 - s_waitcnt vmcnt(0) 7875 must happen after 7876 any preceding 7877 global/generic 7878 load/store/ 7879 load atomic/store atomic/ 7880 atomicrmw. 7881 - s_waitcnt lgkmcnt(0) 7882 must happen after 7883 any preceding 7884 local/generic 7885 load/load 7886 atomic/store/store 7887 atomic/atomicrmw. 7888 - Must happen before 7889 any following 7890 global/generic 7891 load/load 7892 atomic/store/store 7893 atomic/atomicrmw. 7894 - Ensures that all 7895 memory operations 7896 have 7897 completed before 7898 performing any 7899 following global 7900 memory operations. 7901 - Ensures that the 7902 preceding 7903 local/generic load 7904 atomic/atomicrmw 7905 with an equal or 7906 wider sync scope 7907 and memory ordering 7908 stronger than 7909 unordered (this is 7910 termed the 7911 acquire-fence-paired-atomic) 7912 has completed 7913 before following 7914 global memory 7915 operations. This 7916 satisfies the 7917 requirements of 7918 acquire. 7919 - Ensures that all 7920 previous memory 7921 operations have 7922 completed before a 7923 following 7924 local/generic store 7925 atomic/atomicrmw 7926 with an equal or 7927 wider sync scope 7928 and memory ordering 7929 stronger than 7930 unordered (this is 7931 termed the 7932 release-fence-paired-atomic). 7933 This satisfies the 7934 requirements of 7935 release. 7936 - Must happen before 7937 the following 7938 buffer_wbinvl1_vol. 7939 - Ensures that the 7940 acquire-fence-paired 7941 atomic has completed 7942 before invalidating 7943 the 7944 cache. Therefore 7945 any following 7946 locations read must 7947 be no older than 7948 the value read by 7949 the 7950 acquire-fence-paired-atomic. 7951 7952 3. buffer_wbinvl1_vol 7953 7954 - If not TgSplit execution 7955 mode, omit. 7956 - Ensures that 7957 following 7958 loads will not see 7959 stale data. 7960 7961 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 7962 vmcnt(0) 7963 7964 - If TgSplit execution mode, 7965 omit lgkmcnt(0). 7966 - If OpenCL and 7967 address space is 7968 not generic, omit 7969 lgkmcnt(0). 7970 - However, since LLVM 7971 currently has no 7972 address space on 7973 the fence need to 7974 conservatively 7975 always generate 7976 (see comment for 7977 previous fence). 7978 - Could be split into 7979 separate s_waitcnt 7980 vmcnt(0) and 7981 s_waitcnt 7982 lgkmcnt(0) to allow 7983 them to be 7984 independently moved 7985 according to the 7986 following rules. 7987 - s_waitcnt vmcnt(0) 7988 must happen after 7989 any preceding 7990 global/generic 7991 load/store/load 7992 atomic/store 7993 atomic/atomicrmw. 7994 - s_waitcnt lgkmcnt(0) 7995 must happen after 7996 any preceding 7997 local/generic 7998 load/store/load 7999 atomic/store 8000 atomic/atomicrmw. 8001 - Must happen before 8002 the following 8003 buffer_wbinvl1_vol. 8004 - Ensures that the 8005 preceding 8006 global/local/generic 8007 load 8008 atomic/atomicrmw 8009 with an equal or 8010 wider sync scope 8011 and memory ordering 8012 stronger than 8013 unordered (this is 8014 termed the 8015 acquire-fence-paired-atomic) 8016 has completed 8017 before invalidating 8018 the cache. This 8019 satisfies the 8020 requirements of 8021 acquire. 8022 - Ensures that all 8023 previous memory 8024 operations have 8025 completed before a 8026 following 8027 global/local/generic 8028 store 8029 atomic/atomicrmw 8030 with an equal or 8031 wider sync scope 8032 and memory ordering 8033 stronger than 8034 unordered (this is 8035 termed the 8036 release-fence-paired-atomic). 8037 This satisfies the 8038 requirements of 8039 release. 8040 8041 2. buffer_wbinvl1_vol 8042 8043 - Must happen before 8044 any following 8045 global/generic 8046 load/load 8047 atomic/store/store 8048 atomic/atomicrmw. 8049 - Ensures that 8050 following loads 8051 will not see stale 8052 global data. This 8053 satisfies the 8054 requirements of 8055 acquire. 8056 8057 fence acq_rel - system *none* 1. buffer_wbl2 8058 8059 - If OpenCL and 8060 address space is 8061 local, omit. 8062 - Must happen before 8063 following s_waitcnt. 8064 - Performs L2 writeback to 8065 ensure previous 8066 global/generic 8067 store/atomicrmw are 8068 visible at system scope. 8069 8070 2. s_waitcnt lgkmcnt(0) & 8071 vmcnt(0) 8072 8073 - If TgSplit execution mode, 8074 omit lgkmcnt(0). 8075 - If OpenCL and 8076 address space is 8077 not generic, omit 8078 lgkmcnt(0). 8079 - However, since LLVM 8080 currently has no 8081 address space on 8082 the fence need to 8083 conservatively 8084 always generate 8085 (see comment for 8086 previous fence). 8087 - Could be split into 8088 separate s_waitcnt 8089 vmcnt(0) and 8090 s_waitcnt 8091 lgkmcnt(0) to allow 8092 them to be 8093 independently moved 8094 according to the 8095 following rules. 8096 - s_waitcnt vmcnt(0) 8097 must happen after 8098 any preceding 8099 global/generic 8100 load/store/load 8101 atomic/store 8102 atomic/atomicrmw. 8103 - s_waitcnt lgkmcnt(0) 8104 must happen after 8105 any preceding 8106 local/generic 8107 load/store/load 8108 atomic/store 8109 atomic/atomicrmw. 8110 - Must happen before 8111 the following buffer_invl2 and 8112 buffer_wbinvl1_vol. 8113 - Ensures that the 8114 preceding 8115 global/local/generic 8116 load 8117 atomic/atomicrmw 8118 with an equal or 8119 wider sync scope 8120 and memory ordering 8121 stronger than 8122 unordered (this is 8123 termed the 8124 acquire-fence-paired-atomic) 8125 has completed 8126 before invalidating 8127 the cache. This 8128 satisfies the 8129 requirements of 8130 acquire. 8131 - Ensures that all 8132 previous memory 8133 operations have 8134 completed before a 8135 following 8136 global/local/generic 8137 store 8138 atomic/atomicrmw 8139 with an equal or 8140 wider sync scope 8141 and memory ordering 8142 stronger than 8143 unordered (this is 8144 termed the 8145 release-fence-paired-atomic). 8146 This satisfies the 8147 requirements of 8148 release. 8149 8150 3. buffer_invl2; 8151 buffer_wbinvl1_vol 8152 8153 - Must happen before 8154 any following 8155 global/generic 8156 load/load 8157 atomic/store/store 8158 atomic/atomicrmw. 8159 - Ensures that 8160 following loads 8161 will not see stale 8162 MTYPE NC global data. 8163 MTYPE RW and CC memory will 8164 never be stale due to the 8165 memory probes. 8166 8167 **Sequential Consistent Atomic** 8168 ------------------------------------------------------------------------------------ 8169 load atomic seq_cst - singlethread - global *Same as corresponding 8170 - wavefront - local load atomic acquire, 8171 - generic except must generated 8172 all instructions even 8173 for OpenCL.* 8174 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8175 - generic 8176 - Use lgkmcnt(0) if not 8177 TgSplit execution mode 8178 and vmcnt(0) if TgSplit 8179 execution mode. 8180 - s_waitcnt lgkmcnt(0) must 8181 happen after 8182 preceding 8183 local/generic load 8184 atomic/store 8185 atomic/atomicrmw 8186 with memory 8187 ordering of seq_cst 8188 and with equal or 8189 wider sync scope. 8190 (Note that seq_cst 8191 fences have their 8192 own s_waitcnt 8193 lgkmcnt(0) and so do 8194 not need to be 8195 considered.) 8196 - s_waitcnt vmcnt(0) 8197 must happen after 8198 preceding 8199 global/generic load 8200 atomic/store 8201 atomic/atomicrmw 8202 with memory 8203 ordering of seq_cst 8204 and with equal or 8205 wider sync scope. 8206 (Note that seq_cst 8207 fences have their 8208 own s_waitcnt 8209 vmcnt(0) and so do 8210 not need to be 8211 considered.) 8212 - Ensures any 8213 preceding 8214 sequential 8215 consistent global/local 8216 memory instructions 8217 have completed 8218 before executing 8219 this sequentially 8220 consistent 8221 instruction. This 8222 prevents reordering 8223 a seq_cst store 8224 followed by a 8225 seq_cst load. (Note 8226 that seq_cst is 8227 stronger than 8228 acquire/release as 8229 the reordering of 8230 load acquire 8231 followed by a store 8232 release is 8233 prevented by the 8234 s_waitcnt of 8235 the release, but 8236 there is nothing 8237 preventing a store 8238 release followed by 8239 load acquire from 8240 completing out of 8241 order. The s_waitcnt 8242 could be placed after 8243 seq_store or before 8244 the seq_load. We 8245 choose the load to 8246 make the s_waitcnt be 8247 as late as possible 8248 so that the store 8249 may have already 8250 completed.) 8251 8252 2. *Following 8253 instructions same as 8254 corresponding load 8255 atomic acquire, 8256 except must generated 8257 all instructions even 8258 for OpenCL.* 8259 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8260 local address space cannot 8261 be used.* 8262 8263 *Same as corresponding 8264 load atomic acquire, 8265 except must generated 8266 all instructions even 8267 for OpenCL.* 8268 8269 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8270 - system - generic vmcnt(0) 8271 8272 - If TgSplit execution mode, 8273 omit lgkmcnt(0). 8274 - Could be split into 8275 separate s_waitcnt 8276 vmcnt(0) 8277 and s_waitcnt 8278 lgkmcnt(0) to allow 8279 them to be 8280 independently moved 8281 according to the 8282 following rules. 8283 - s_waitcnt lgkmcnt(0) 8284 must happen after 8285 preceding 8286 global/generic load 8287 atomic/store 8288 atomic/atomicrmw 8289 with memory 8290 ordering of seq_cst 8291 and with equal or 8292 wider sync scope. 8293 (Note that seq_cst 8294 fences have their 8295 own s_waitcnt 8296 lgkmcnt(0) and so do 8297 not need to be 8298 considered.) 8299 - s_waitcnt vmcnt(0) 8300 must happen after 8301 preceding 8302 global/generic load 8303 atomic/store 8304 atomic/atomicrmw 8305 with memory 8306 ordering of seq_cst 8307 and with equal or 8308 wider sync scope. 8309 (Note that seq_cst 8310 fences have their 8311 own s_waitcnt 8312 vmcnt(0) and so do 8313 not need to be 8314 considered.) 8315 - Ensures any 8316 preceding 8317 sequential 8318 consistent global 8319 memory instructions 8320 have completed 8321 before executing 8322 this sequentially 8323 consistent 8324 instruction. This 8325 prevents reordering 8326 a seq_cst store 8327 followed by a 8328 seq_cst load. (Note 8329 that seq_cst is 8330 stronger than 8331 acquire/release as 8332 the reordering of 8333 load acquire 8334 followed by a store 8335 release is 8336 prevented by the 8337 s_waitcnt of 8338 the release, but 8339 there is nothing 8340 preventing a store 8341 release followed by 8342 load acquire from 8343 completing out of 8344 order. The s_waitcnt 8345 could be placed after 8346 seq_store or before 8347 the seq_load. We 8348 choose the load to 8349 make the s_waitcnt be 8350 as late as possible 8351 so that the store 8352 may have already 8353 completed.) 8354 8355 2. *Following 8356 instructions same as 8357 corresponding load 8358 atomic acquire, 8359 except must generated 8360 all instructions even 8361 for OpenCL.* 8362 store atomic seq_cst - singlethread - global *Same as corresponding 8363 - wavefront - local store atomic release, 8364 - workgroup - generic except must generated 8365 - agent all instructions even 8366 - system for OpenCL.* 8367 atomicrmw seq_cst - singlethread - global *Same as corresponding 8368 - wavefront - local atomicrmw acq_rel, 8369 - workgroup - generic except must generated 8370 - agent all instructions even 8371 - system for OpenCL.* 8372 fence seq_cst - singlethread *none* *Same as corresponding 8373 - wavefront fence acq_rel, 8374 - workgroup except must generated 8375 - agent all instructions even 8376 - system for OpenCL.* 8377 ============ ============ ============== ========== ================================ 8378 8379.. _amdgpu-amdhsa-memory-model-gfx10: 8380 8381Memory Model GFX10 8382++++++++++++++++++ 8383 8384For GFX10: 8385 8386* Each agent has multiple shader arrays (SA). 8387* Each SA has multiple work-group processors (WGP). 8388* Each WGP has multiple compute units (CU). 8389* Each CU has multiple SIMDs that execute wavefronts. 8390* The wavefronts for a single work-group are executed in the same 8391 WGP. In CU wavefront execution mode the wavefronts may be executed by 8392 different SIMDs in the same CU. In WGP wavefront execution mode the 8393 wavefronts may be executed by different SIMDs in different CUs in the same 8394 WGP. 8395* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 8396 executing on it. 8397* All LDS operations of a WGP are performed as wavefront wide operations in a 8398 global order and involve no caching. Completion is reported to a wavefront in 8399 execution order. 8400* The LDS memory has multiple request queues shared by the SIMDs of a 8401 WGP. Therefore, the LDS operations performed by different wavefronts of a 8402 work-group can be reordered relative to each other, which can result in 8403 reordering the visibility of vector memory operations with respect to LDS 8404 operations of other wavefronts in the same work-group. A ``s_waitcnt 8405 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8406 vector memory operations between wavefronts of a work-group, but not between 8407 operations performed by the same wavefront. 8408* The vector memory operations are performed as wavefront wide operations. 8409 Completion of load/store/sample operations are reported to a wavefront in 8410 execution order of other load/store/sample operations performed by that 8411 wavefront. 8412* The vector memory operations access a vector L0 cache. There is a single L0 8413 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 8414 special action is required for coherence between the lanes of a single 8415 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 8416 wavefronts executing in the same work-group as they may be executing on SIMDs 8417 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 8418 required for coherence between wavefronts executing in different work-groups 8419 as they may be executing on different WGPs. 8420* The scalar memory operations access a scalar L0 cache shared by all wavefronts 8421 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 8422 operations are used in a restricted way so do not impact the memory model. See 8423 :ref:`amdgpu-amdhsa-memory-spaces`. 8424* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 8425 the same SA. Therefore, no special action is required for coherence between 8426 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 8427 required for coherence between wavefronts executing in different work-groups 8428 as they may be executing on different SAs that access different L1s. 8429* The L1 caches have independent quadrants to service disjoint ranges of virtual 8430 addresses. 8431* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 8432 vector and scalar memory operations performed by different wavefronts, whether 8433 executing in the same or different work-groups (which may be executing on 8434 different CUs accessing different L0s), can be reordered relative to each 8435 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 8436 synchronization between vector memory operations of different wavefronts. It 8437 ensures a previous vector memory operation has completed before executing a 8438 subsequent vector memory or LDS operation and so can be used to meet the 8439 requirements of acquire, release and sequential consistency. 8440* The L1 caches use an L2 cache shared by all SAs on the same agent. 8441* The L2 cache has independent channels to service disjoint ranges of virtual 8442 addresses. 8443* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 8444 quadrant has a separate request queue per L2 channel. Therefore, the vector 8445 and scalar memory operations performed by wavefronts executing in different 8446 work-groups (which may be executing on different SAs) of an agent can be 8447 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 8448 required to ensure synchronization between vector memory operations of 8449 different SAs. It ensures a previous vector memory operation has completed 8450 before executing a subsequent vector memory and so can be used to meet the 8451 requirements of acquire, release and sequential consistency. 8452* The L2 cache can be kept coherent with other agents on some targets, or ranges 8453 of virtual addresses can be set up to bypass it to ensure system coherence. 8454 8455Scalar memory operations are only used to access memory that is proven to not 8456change during the execution of the kernel dispatch. This includes constant 8457address space and global address space for program scope ``const`` variables. 8458Therefore, the kernel machine code does not have to maintain the scalar cache to 8459ensure it is coherent with the vector caches. The scalar and vector caches are 8460invalidated between kernel dispatches by CP since constant address space data 8461may change between kernel dispatch executions. See 8462:ref:`amdgpu-amdhsa-memory-spaces`. 8463 8464The one exception is if scalar writes are used to spill SGPR registers. In this 8465case the AMDGPU backend ensures the memory location used to spill is never 8466accessed by vector memory operations at the same time. If scalar writes are used 8467then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8468return since the locations may be used for vector memory instructions by a 8469future wavefront that uses the same scratch area, or a function call that 8470creates a frame at the same address, respectively. There is no need for a 8471``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8472 8473For kernarg backing memory: 8474 8475* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 8476* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 8477 needing to invalidate the L2 cache. 8478* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8479 so the L2 cache will be coherent with the CPU and other agents. 8480 8481Scratch backing memory (which is used for the private address space) is accessed 8482with MTYPE NC (non-coherent). Since the private address space is only accessed 8483by a single thread, and is always write-before-read, there is never a need to 8484invalidate these entries from the L0 or L1 caches. 8485 8486Wavefronts are executed in native mode with in-order reporting of loads and 8487sample instructions. In this mode vmcnt reports completion of load, atomic with 8488return and sample instructions in order, and the vscnt reports the completion of 8489store and atomic without return in order. See ``MEM_ORDERED`` field in 8490:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 8491 8492Wavefronts can be executed in WGP or CU wavefront execution mode: 8493 8494* In WGP wavefront execution mode the wavefronts of a work-group are executed 8495 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 8496 CU L0 caches is required for work-group synchronization. Also accesses to L1 8497 at work-group scope need to be explicitly ordered as the accesses from 8498 different CUs are not ordered. 8499* In CU wavefront execution mode the wavefronts of a work-group are executed on 8500 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 8501 the work-group access the same L0 which in turn ensures L1 accesses are 8502 ordered and so do not require explicit management of the caches for 8503 work-group synchronization. 8504 8505See ``WGP_MODE`` field in 8506:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 8507:ref:`amdgpu-target-features`. 8508 8509The code sequences used to implement the memory model for GFX10 are defined in 8510table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 8511 8512 .. table:: AMDHSA Memory Model Code Sequences GFX10 8513 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 8514 8515 ============ ============ ============== ========== ================================ 8516 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8517 Ordering Sync Scope Address GFX10 8518 Space 8519 ============ ============ ============== ========== ================================ 8520 **Non-Atomic** 8521 ------------------------------------------------------------------------------------ 8522 load *none* *none* - global - !volatile & !nontemporal 8523 - generic 8524 - private 1. buffer/global/flat_load 8525 - constant 8526 - !volatile & nontemporal 8527 8528 1. buffer/global/flat_load 8529 slc=1 8530 8531 - volatile 8532 8533 1. buffer/global/flat_load 8534 glc=1 dlc=1 8535 2. s_waitcnt vmcnt(0) 8536 8537 - Must happen before 8538 any following volatile 8539 global/generic 8540 load/store. 8541 - Ensures that 8542 volatile 8543 operations to 8544 different 8545 addresses will not 8546 be reordered by 8547 hardware. 8548 8549 load *none* *none* - local 1. ds_load 8550 store *none* *none* - global - !volatile & !nontemporal 8551 - generic 8552 - private 1. buffer/global/flat_store 8553 - constant 8554 - !volatile & nontemporal 8555 8556 1. buffer/global/flat_store 8557 slc=1 8558 8559 - volatile 8560 8561 1. buffer/global/flat_store 8562 2. s_waitcnt vscnt(0) 8563 8564 - Must happen before 8565 any following volatile 8566 global/generic 8567 load/store. 8568 - Ensures that 8569 volatile 8570 operations to 8571 different 8572 addresses will not 8573 be reordered by 8574 hardware. 8575 8576 store *none* *none* - local 1. ds_store 8577 **Unordered Atomic** 8578 ------------------------------------------------------------------------------------ 8579 load atomic unordered *any* *any* *Same as non-atomic*. 8580 store atomic unordered *any* *any* *Same as non-atomic*. 8581 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8582 **Monotonic Atomic** 8583 ------------------------------------------------------------------------------------ 8584 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8585 - wavefront - generic 8586 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8587 - generic glc=1 8588 8589 - If CU wavefront execution 8590 mode, omit glc=1. 8591 8592 load atomic monotonic - singlethread - local 1. ds_load 8593 - wavefront 8594 - workgroup 8595 load atomic monotonic - agent - global 1. buffer/global/flat_load 8596 - system - generic glc=1 dlc=1 8597 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8598 - wavefront - generic 8599 - workgroup 8600 - agent 8601 - system 8602 store atomic monotonic - singlethread - local 1. ds_store 8603 - wavefront 8604 - workgroup 8605 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8606 - wavefront - generic 8607 - workgroup 8608 - agent 8609 - system 8610 atomicrmw monotonic - singlethread - local 1. ds_atomic 8611 - wavefront 8612 - workgroup 8613 **Acquire Atomic** 8614 ------------------------------------------------------------------------------------ 8615 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 8616 - wavefront - local 8617 - generic 8618 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 8619 8620 - If CU wavefront execution 8621 mode, omit glc=1. 8622 8623 2. s_waitcnt vmcnt(0) 8624 8625 - If CU wavefront execution 8626 mode, omit. 8627 - Must happen before 8628 the following buffer_gl0_inv 8629 and before any following 8630 global/generic 8631 load/load 8632 atomic/store/store 8633 atomic/atomicrmw. 8634 8635 3. buffer_gl0_inv 8636 8637 - If CU wavefront execution 8638 mode, omit. 8639 - Ensures that 8640 following 8641 loads will not see 8642 stale data. 8643 8644 load atomic acquire - workgroup - local 1. ds_load 8645 2. s_waitcnt lgkmcnt(0) 8646 8647 - If OpenCL, omit. 8648 - Must happen before 8649 the following buffer_gl0_inv 8650 and before any following 8651 global/generic load/load 8652 atomic/store/store 8653 atomic/atomicrmw. 8654 - Ensures any 8655 following global 8656 data read is no 8657 older than the local load 8658 atomic value being 8659 acquired. 8660 8661 3. buffer_gl0_inv 8662 8663 - If CU wavefront execution 8664 mode, omit. 8665 - If OpenCL, omit. 8666 - Ensures that 8667 following 8668 loads will not see 8669 stale data. 8670 8671 load atomic acquire - workgroup - generic 1. flat_load glc=1 8672 8673 - If CU wavefront execution 8674 mode, omit glc=1. 8675 8676 2. s_waitcnt lgkmcnt(0) & 8677 vmcnt(0) 8678 8679 - If CU wavefront execution 8680 mode, omit vmcnt(0). 8681 - If OpenCL, omit 8682 lgkmcnt(0). 8683 - Must happen before 8684 the following 8685 buffer_gl0_inv and any 8686 following global/generic 8687 load/load 8688 atomic/store/store 8689 atomic/atomicrmw. 8690 - Ensures any 8691 following global 8692 data read is no 8693 older than a local load 8694 atomic value being 8695 acquired. 8696 8697 3. buffer_gl0_inv 8698 8699 - If CU wavefront execution 8700 mode, omit. 8701 - Ensures that 8702 following 8703 loads will not see 8704 stale data. 8705 8706 load atomic acquire - agent - global 1. buffer/global_load 8707 - system glc=1 dlc=1 8708 2. s_waitcnt vmcnt(0) 8709 8710 - Must happen before 8711 following 8712 buffer_gl*_inv. 8713 - Ensures the load 8714 has completed 8715 before invalidating 8716 the caches. 8717 8718 3. buffer_gl0_inv; 8719 buffer_gl1_inv 8720 8721 - Must happen before 8722 any following 8723 global/generic 8724 load/load 8725 atomic/atomicrmw. 8726 - Ensures that 8727 following 8728 loads will not see 8729 stale global data. 8730 8731 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 8732 - system 2. s_waitcnt vmcnt(0) & 8733 lgkmcnt(0) 8734 8735 - If OpenCL omit 8736 lgkmcnt(0). 8737 - Must happen before 8738 following 8739 buffer_gl*_invl. 8740 - Ensures the flat_load 8741 has completed 8742 before invalidating 8743 the caches. 8744 8745 3. buffer_gl0_inv; 8746 buffer_gl1_inv 8747 8748 - Must happen before 8749 any following 8750 global/generic 8751 load/load 8752 atomic/atomicrmw. 8753 - Ensures that 8754 following loads 8755 will not see stale 8756 global data. 8757 8758 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 8759 - wavefront - local 8760 - generic 8761 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 8762 2. s_waitcnt vm/vscnt(0) 8763 8764 - If CU wavefront execution 8765 mode, omit. 8766 - Use vmcnt(0) if atomic with 8767 return and vscnt(0) if 8768 atomic with no-return. 8769 - Must happen before 8770 the following buffer_gl0_inv 8771 and before any following 8772 global/generic 8773 load/load 8774 atomic/store/store 8775 atomic/atomicrmw. 8776 8777 3. buffer_gl0_inv 8778 8779 - If CU wavefront execution 8780 mode, omit. 8781 - Ensures that 8782 following 8783 loads will not see 8784 stale data. 8785 8786 atomicrmw acquire - workgroup - local 1. ds_atomic 8787 2. s_waitcnt lgkmcnt(0) 8788 8789 - If OpenCL, omit. 8790 - Must happen before 8791 the following 8792 buffer_gl0_inv. 8793 - Ensures any 8794 following global 8795 data read is no 8796 older than the local 8797 atomicrmw value 8798 being acquired. 8799 8800 3. buffer_gl0_inv 8801 8802 - If OpenCL omit. 8803 - Ensures that 8804 following 8805 loads will not see 8806 stale data. 8807 8808 atomicrmw acquire - workgroup - generic 1. flat_atomic 8809 2. s_waitcnt lgkmcnt(0) & 8810 vm/vscnt(0) 8811 8812 - If CU wavefront execution 8813 mode, omit vm/vscnt(0). 8814 - If OpenCL, omit lgkmcnt(0). 8815 - Use vmcnt(0) if atomic with 8816 return and vscnt(0) if 8817 atomic with no-return. 8818 - Must happen before 8819 the following 8820 buffer_gl0_inv. 8821 - Ensures any 8822 following global 8823 data read is no 8824 older than a local 8825 atomicrmw value 8826 being acquired. 8827 8828 3. buffer_gl0_inv 8829 8830 - If CU wavefront execution 8831 mode, omit. 8832 - Ensures that 8833 following 8834 loads will not see 8835 stale data. 8836 8837 atomicrmw acquire - agent - global 1. buffer/global_atomic 8838 - system 2. s_waitcnt vm/vscnt(0) 8839 8840 - Use vmcnt(0) if atomic with 8841 return and vscnt(0) if 8842 atomic with no-return. 8843 - Must happen before 8844 following 8845 buffer_gl*_inv. 8846 - Ensures the 8847 atomicrmw has 8848 completed before 8849 invalidating the 8850 caches. 8851 8852 3. buffer_gl0_inv; 8853 buffer_gl1_inv 8854 8855 - Must happen before 8856 any following 8857 global/generic 8858 load/load 8859 atomic/atomicrmw. 8860 - Ensures that 8861 following loads 8862 will not see stale 8863 global data. 8864 8865 atomicrmw acquire - agent - generic 1. flat_atomic 8866 - system 2. s_waitcnt vm/vscnt(0) & 8867 lgkmcnt(0) 8868 8869 - If OpenCL, omit 8870 lgkmcnt(0). 8871 - Use vmcnt(0) if atomic with 8872 return and vscnt(0) if 8873 atomic with no-return. 8874 - Must happen before 8875 following 8876 buffer_gl*_inv. 8877 - Ensures the 8878 atomicrmw has 8879 completed before 8880 invalidating the 8881 caches. 8882 8883 3. buffer_gl0_inv; 8884 buffer_gl1_inv 8885 8886 - Must happen before 8887 any following 8888 global/generic 8889 load/load 8890 atomic/atomicrmw. 8891 - Ensures that 8892 following loads 8893 will not see stale 8894 global data. 8895 8896 fence acquire - singlethread *none* *none* 8897 - wavefront 8898 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 8899 vmcnt(0) & vscnt(0) 8900 8901 - If CU wavefront execution 8902 mode, omit vmcnt(0) and 8903 vscnt(0). 8904 - If OpenCL and 8905 address space is 8906 not generic, omit 8907 lgkmcnt(0). 8908 - If OpenCL and 8909 address space is 8910 local, omit 8911 vmcnt(0) and vscnt(0). 8912 - However, since LLVM 8913 currently has no 8914 address space on 8915 the fence need to 8916 conservatively 8917 always generate. If 8918 fence had an 8919 address space then 8920 set to address 8921 space of OpenCL 8922 fence flag, or to 8923 generic if both 8924 local and global 8925 flags are 8926 specified. 8927 - Could be split into 8928 separate s_waitcnt 8929 vmcnt(0), s_waitcnt 8930 vscnt(0) and s_waitcnt 8931 lgkmcnt(0) to allow 8932 them to be 8933 independently moved 8934 according to the 8935 following rules. 8936 - s_waitcnt vmcnt(0) 8937 must happen after 8938 any preceding 8939 global/generic load 8940 atomic/ 8941 atomicrmw-with-return-value 8942 with an equal or 8943 wider sync scope 8944 and memory ordering 8945 stronger than 8946 unordered (this is 8947 termed the 8948 fence-paired-atomic). 8949 - s_waitcnt vscnt(0) 8950 must happen after 8951 any preceding 8952 global/generic 8953 atomicrmw-no-return-value 8954 with an equal or 8955 wider sync scope 8956 and memory ordering 8957 stronger than 8958 unordered (this is 8959 termed the 8960 fence-paired-atomic). 8961 - s_waitcnt lgkmcnt(0) 8962 must happen after 8963 any preceding 8964 local/generic load 8965 atomic/atomicrmw 8966 with an equal or 8967 wider sync scope 8968 and memory ordering 8969 stronger than 8970 unordered (this is 8971 termed the 8972 fence-paired-atomic). 8973 - Must happen before 8974 the following 8975 buffer_gl0_inv. 8976 - Ensures that the 8977 fence-paired atomic 8978 has completed 8979 before invalidating 8980 the 8981 cache. Therefore 8982 any following 8983 locations read must 8984 be no older than 8985 the value read by 8986 the 8987 fence-paired-atomic. 8988 8989 3. buffer_gl0_inv 8990 8991 - If CU wavefront execution 8992 mode, omit. 8993 - Ensures that 8994 following 8995 loads will not see 8996 stale data. 8997 8998 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 8999 - system vmcnt(0) & vscnt(0) 9000 9001 - If OpenCL and 9002 address space is 9003 not generic, omit 9004 lgkmcnt(0). 9005 - If OpenCL and 9006 address space is 9007 local, omit 9008 vmcnt(0) and vscnt(0). 9009 - However, since LLVM 9010 currently has no 9011 address space on 9012 the fence need to 9013 conservatively 9014 always generate 9015 (see comment for 9016 previous fence). 9017 - Could be split into 9018 separate s_waitcnt 9019 vmcnt(0), s_waitcnt 9020 vscnt(0) and s_waitcnt 9021 lgkmcnt(0) to allow 9022 them to be 9023 independently moved 9024 according to the 9025 following rules. 9026 - s_waitcnt vmcnt(0) 9027 must happen after 9028 any preceding 9029 global/generic load 9030 atomic/ 9031 atomicrmw-with-return-value 9032 with an equal or 9033 wider sync scope 9034 and memory ordering 9035 stronger than 9036 unordered (this is 9037 termed the 9038 fence-paired-atomic). 9039 - s_waitcnt vscnt(0) 9040 must happen after 9041 any preceding 9042 global/generic 9043 atomicrmw-no-return-value 9044 with an equal or 9045 wider sync scope 9046 and memory ordering 9047 stronger than 9048 unordered (this is 9049 termed the 9050 fence-paired-atomic). 9051 - s_waitcnt lgkmcnt(0) 9052 must happen after 9053 any preceding 9054 local/generic load 9055 atomic/atomicrmw 9056 with an equal or 9057 wider sync scope 9058 and memory ordering 9059 stronger than 9060 unordered (this is 9061 termed the 9062 fence-paired-atomic). 9063 - Must happen before 9064 the following 9065 buffer_gl*_inv. 9066 - Ensures that the 9067 fence-paired atomic 9068 has completed 9069 before invalidating 9070 the 9071 caches. Therefore 9072 any following 9073 locations read must 9074 be no older than 9075 the value read by 9076 the 9077 fence-paired-atomic. 9078 9079 2. buffer_gl0_inv; 9080 buffer_gl1_inv 9081 9082 - Must happen before any 9083 following global/generic 9084 load/load 9085 atomic/store/store 9086 atomic/atomicrmw. 9087 - Ensures that 9088 following loads 9089 will not see stale 9090 global data. 9091 9092 **Release Atomic** 9093 ------------------------------------------------------------------------------------ 9094 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 9095 - wavefront - local 9096 - generic 9097 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9098 - generic vmcnt(0) & vscnt(0) 9099 9100 - If CU wavefront execution 9101 mode, omit vmcnt(0) and 9102 vscnt(0). 9103 - If OpenCL, omit 9104 lgkmcnt(0). 9105 - Could be split into 9106 separate s_waitcnt 9107 vmcnt(0), s_waitcnt 9108 vscnt(0) and s_waitcnt 9109 lgkmcnt(0) to allow 9110 them to be 9111 independently moved 9112 according to the 9113 following rules. 9114 - s_waitcnt vmcnt(0) 9115 must happen after 9116 any preceding 9117 global/generic load/load 9118 atomic/ 9119 atomicrmw-with-return-value. 9120 - s_waitcnt vscnt(0) 9121 must happen after 9122 any preceding 9123 global/generic 9124 store/store 9125 atomic/ 9126 atomicrmw-no-return-value. 9127 - s_waitcnt lgkmcnt(0) 9128 must happen after 9129 any preceding 9130 local/generic 9131 load/store/load 9132 atomic/store 9133 atomic/atomicrmw. 9134 - Must happen before 9135 the following 9136 store. 9137 - Ensures that all 9138 memory operations 9139 have 9140 completed before 9141 performing the 9142 store that is being 9143 released. 9144 9145 2. buffer/global/flat_store 9146 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9147 9148 - If CU wavefront execution 9149 mode, omit. 9150 - If OpenCL, omit. 9151 - Could be split into 9152 separate s_waitcnt 9153 vmcnt(0) and s_waitcnt 9154 vscnt(0) to allow 9155 them to be 9156 independently moved 9157 according to the 9158 following rules. 9159 - s_waitcnt vmcnt(0) 9160 must happen after 9161 any preceding 9162 global/generic load/load 9163 atomic/ 9164 atomicrmw-with-return-value. 9165 - s_waitcnt vscnt(0) 9166 must happen after 9167 any preceding 9168 global/generic 9169 store/store atomic/ 9170 atomicrmw-no-return-value. 9171 - Must happen before 9172 the following 9173 store. 9174 - Ensures that all 9175 global memory 9176 operations have 9177 completed before 9178 performing the 9179 store that is being 9180 released. 9181 9182 2. ds_store 9183 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 9184 - system - generic vmcnt(0) & vscnt(0) 9185 9186 - If OpenCL and 9187 address space is 9188 not generic, omit 9189 lgkmcnt(0). 9190 - Could be split into 9191 separate s_waitcnt 9192 vmcnt(0), s_waitcnt vscnt(0) 9193 and s_waitcnt 9194 lgkmcnt(0) to allow 9195 them to be 9196 independently moved 9197 according to the 9198 following rules. 9199 - s_waitcnt vmcnt(0) 9200 must happen after 9201 any preceding 9202 global/generic 9203 load/load 9204 atomic/ 9205 atomicrmw-with-return-value. 9206 - s_waitcnt vscnt(0) 9207 must happen after 9208 any preceding 9209 global/generic 9210 store/store atomic/ 9211 atomicrmw-no-return-value. 9212 - s_waitcnt lgkmcnt(0) 9213 must happen after 9214 any preceding 9215 local/generic 9216 load/store/load 9217 atomic/store 9218 atomic/atomicrmw. 9219 - Must happen before 9220 the following 9221 store. 9222 - Ensures that all 9223 memory operations 9224 have 9225 completed before 9226 performing the 9227 store that is being 9228 released. 9229 9230 2. buffer/global/flat_store 9231 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 9232 - wavefront - local 9233 - generic 9234 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9235 - generic vmcnt(0) & vscnt(0) 9236 9237 - If CU wavefront execution 9238 mode, omit vmcnt(0) and 9239 vscnt(0). 9240 - If OpenCL, omit lgkmcnt(0). 9241 - Could be split into 9242 separate s_waitcnt 9243 vmcnt(0), s_waitcnt 9244 vscnt(0) and s_waitcnt 9245 lgkmcnt(0) to allow 9246 them to be 9247 independently moved 9248 according to the 9249 following rules. 9250 - s_waitcnt vmcnt(0) 9251 must happen after 9252 any preceding 9253 global/generic load/load 9254 atomic/ 9255 atomicrmw-with-return-value. 9256 - s_waitcnt vscnt(0) 9257 must happen after 9258 any preceding 9259 global/generic 9260 store/store 9261 atomic/ 9262 atomicrmw-no-return-value. 9263 - s_waitcnt lgkmcnt(0) 9264 must happen after 9265 any preceding 9266 local/generic 9267 load/store/load 9268 atomic/store 9269 atomic/atomicrmw. 9270 - Must happen before 9271 the following 9272 atomicrmw. 9273 - Ensures that all 9274 memory operations 9275 have 9276 completed before 9277 performing the 9278 atomicrmw that is 9279 being released. 9280 9281 2. buffer/global/flat_atomic 9282 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9283 9284 - If CU wavefront execution 9285 mode, omit. 9286 - If OpenCL, omit. 9287 - Could be split into 9288 separate s_waitcnt 9289 vmcnt(0) and s_waitcnt 9290 vscnt(0) to allow 9291 them to be 9292 independently moved 9293 according to the 9294 following rules. 9295 - s_waitcnt vmcnt(0) 9296 must happen after 9297 any preceding 9298 global/generic load/load 9299 atomic/ 9300 atomicrmw-with-return-value. 9301 - s_waitcnt vscnt(0) 9302 must happen after 9303 any preceding 9304 global/generic 9305 store/store atomic/ 9306 atomicrmw-no-return-value. 9307 - Must happen before 9308 the following 9309 store. 9310 - Ensures that all 9311 global memory 9312 operations have 9313 completed before 9314 performing the 9315 store that is being 9316 released. 9317 9318 2. ds_atomic 9319 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 9320 - system - generic vmcnt(0) & vscnt(0) 9321 9322 - If OpenCL, omit 9323 lgkmcnt(0). 9324 - Could be split into 9325 separate s_waitcnt 9326 vmcnt(0), s_waitcnt 9327 vscnt(0) and s_waitcnt 9328 lgkmcnt(0) to allow 9329 them to be 9330 independently moved 9331 according to the 9332 following rules. 9333 - s_waitcnt vmcnt(0) 9334 must happen after 9335 any preceding 9336 global/generic 9337 load/load atomic/ 9338 atomicrmw-with-return-value. 9339 - s_waitcnt vscnt(0) 9340 must happen after 9341 any preceding 9342 global/generic 9343 store/store atomic/ 9344 atomicrmw-no-return-value. 9345 - s_waitcnt lgkmcnt(0) 9346 must happen after 9347 any preceding 9348 local/generic 9349 load/store/load 9350 atomic/store 9351 atomic/atomicrmw. 9352 - Must happen before 9353 the following 9354 atomicrmw. 9355 - Ensures that all 9356 memory operations 9357 to global and local 9358 have completed 9359 before performing 9360 the atomicrmw that 9361 is being released. 9362 9363 2. buffer/global/flat_atomic 9364 fence release - singlethread *none* *none* 9365 - wavefront 9366 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9367 vmcnt(0) & vscnt(0) 9368 9369 - If CU wavefront execution 9370 mode, omit vmcnt(0) and 9371 vscnt(0). 9372 - If OpenCL and 9373 address space is 9374 not generic, omit 9375 lgkmcnt(0). 9376 - If OpenCL and 9377 address space is 9378 local, omit 9379 vmcnt(0) and vscnt(0). 9380 - However, since LLVM 9381 currently has no 9382 address space on 9383 the fence need to 9384 conservatively 9385 always generate. If 9386 fence had an 9387 address space then 9388 set to address 9389 space of OpenCL 9390 fence flag, or to 9391 generic if both 9392 local and global 9393 flags are 9394 specified. 9395 - Could be split into 9396 separate s_waitcnt 9397 vmcnt(0), s_waitcnt 9398 vscnt(0) and s_waitcnt 9399 lgkmcnt(0) to allow 9400 them to be 9401 independently moved 9402 according to the 9403 following rules. 9404 - s_waitcnt vmcnt(0) 9405 must happen after 9406 any preceding 9407 global/generic 9408 load/load 9409 atomic/ 9410 atomicrmw-with-return-value. 9411 - s_waitcnt vscnt(0) 9412 must happen after 9413 any preceding 9414 global/generic 9415 store/store atomic/ 9416 atomicrmw-no-return-value. 9417 - s_waitcnt lgkmcnt(0) 9418 must happen after 9419 any preceding 9420 local/generic 9421 load/store/load 9422 atomic/store atomic/ 9423 atomicrmw. 9424 - Must happen before 9425 any following store 9426 atomic/atomicrmw 9427 with an equal or 9428 wider sync scope 9429 and memory ordering 9430 stronger than 9431 unordered (this is 9432 termed the 9433 fence-paired-atomic). 9434 - Ensures that all 9435 memory operations 9436 have 9437 completed before 9438 performing the 9439 following 9440 fence-paired-atomic. 9441 9442 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 9443 - system vmcnt(0) & vscnt(0) 9444 9445 - If OpenCL and 9446 address space is 9447 not generic, omit 9448 lgkmcnt(0). 9449 - If OpenCL and 9450 address space is 9451 local, omit 9452 vmcnt(0) and vscnt(0). 9453 - However, since LLVM 9454 currently has no 9455 address space on 9456 the fence need to 9457 conservatively 9458 always generate. If 9459 fence had an 9460 address space then 9461 set to address 9462 space of OpenCL 9463 fence flag, or to 9464 generic if both 9465 local and global 9466 flags are 9467 specified. 9468 - Could be split into 9469 separate s_waitcnt 9470 vmcnt(0), s_waitcnt 9471 vscnt(0) and s_waitcnt 9472 lgkmcnt(0) to allow 9473 them to be 9474 independently moved 9475 according to the 9476 following rules. 9477 - s_waitcnt vmcnt(0) 9478 must happen after 9479 any preceding 9480 global/generic 9481 load/load atomic/ 9482 atomicrmw-with-return-value. 9483 - s_waitcnt vscnt(0) 9484 must happen after 9485 any preceding 9486 global/generic 9487 store/store atomic/ 9488 atomicrmw-no-return-value. 9489 - s_waitcnt lgkmcnt(0) 9490 must happen after 9491 any preceding 9492 local/generic 9493 load/store/load 9494 atomic/store 9495 atomic/atomicrmw. 9496 - Must happen before 9497 any following store 9498 atomic/atomicrmw 9499 with an equal or 9500 wider sync scope 9501 and memory ordering 9502 stronger than 9503 unordered (this is 9504 termed the 9505 fence-paired-atomic). 9506 - Ensures that all 9507 memory operations 9508 have 9509 completed before 9510 performing the 9511 following 9512 fence-paired-atomic. 9513 9514 **Acquire-Release Atomic** 9515 ------------------------------------------------------------------------------------ 9516 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 9517 - wavefront - local 9518 - generic 9519 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9520 vmcnt(0) & vscnt(0) 9521 9522 - If CU wavefront execution 9523 mode, omit vmcnt(0) and 9524 vscnt(0). 9525 - If OpenCL, omit 9526 lgkmcnt(0). 9527 - Must happen after 9528 any preceding 9529 local/generic 9530 load/store/load 9531 atomic/store 9532 atomic/atomicrmw. 9533 - Could be split into 9534 separate s_waitcnt 9535 vmcnt(0), s_waitcnt 9536 vscnt(0), and s_waitcnt 9537 lgkmcnt(0) to allow 9538 them to be 9539 independently moved 9540 according to the 9541 following rules. 9542 - s_waitcnt vmcnt(0) 9543 must happen after 9544 any preceding 9545 global/generic load/load 9546 atomic/ 9547 atomicrmw-with-return-value. 9548 - s_waitcnt vscnt(0) 9549 must happen after 9550 any preceding 9551 global/generic 9552 store/store 9553 atomic/ 9554 atomicrmw-no-return-value. 9555 - s_waitcnt lgkmcnt(0) 9556 must happen after 9557 any preceding 9558 local/generic 9559 load/store/load 9560 atomic/store 9561 atomic/atomicrmw. 9562 - Must happen before 9563 the following 9564 atomicrmw. 9565 - Ensures that all 9566 memory operations 9567 have 9568 completed before 9569 performing the 9570 atomicrmw that is 9571 being released. 9572 9573 2. buffer/global_atomic 9574 3. s_waitcnt vm/vscnt(0) 9575 9576 - If CU wavefront execution 9577 mode, omit. 9578 - Use vmcnt(0) if atomic with 9579 return and vscnt(0) if 9580 atomic with no-return. 9581 - Must happen before 9582 the following 9583 buffer_gl0_inv. 9584 - Ensures any 9585 following global 9586 data read is no 9587 older than the 9588 atomicrmw value 9589 being acquired. 9590 9591 4. buffer_gl0_inv 9592 9593 - If CU wavefront execution 9594 mode, omit. 9595 - Ensures that 9596 following 9597 loads will not see 9598 stale data. 9599 9600 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9601 9602 - If CU wavefront execution 9603 mode, omit. 9604 - If OpenCL, omit. 9605 - Could be split into 9606 separate s_waitcnt 9607 vmcnt(0) and s_waitcnt 9608 vscnt(0) to allow 9609 them to be 9610 independently moved 9611 according to the 9612 following rules. 9613 - s_waitcnt vmcnt(0) 9614 must happen after 9615 any preceding 9616 global/generic load/load 9617 atomic/ 9618 atomicrmw-with-return-value. 9619 - s_waitcnt vscnt(0) 9620 must happen after 9621 any preceding 9622 global/generic 9623 store/store atomic/ 9624 atomicrmw-no-return-value. 9625 - Must happen before 9626 the following 9627 store. 9628 - Ensures that all 9629 global memory 9630 operations have 9631 completed before 9632 performing the 9633 store that is being 9634 released. 9635 9636 2. ds_atomic 9637 3. s_waitcnt lgkmcnt(0) 9638 9639 - If OpenCL, omit. 9640 - Must happen before 9641 the following 9642 buffer_gl0_inv. 9643 - Ensures any 9644 following global 9645 data read is no 9646 older than the local load 9647 atomic value being 9648 acquired. 9649 9650 4. buffer_gl0_inv 9651 9652 - If CU wavefront execution 9653 mode, omit. 9654 - If OpenCL omit. 9655 - Ensures that 9656 following 9657 loads will not see 9658 stale data. 9659 9660 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 9661 vmcnt(0) & vscnt(0) 9662 9663 - If CU wavefront execution 9664 mode, omit vmcnt(0) and 9665 vscnt(0). 9666 - If OpenCL, omit lgkmcnt(0). 9667 - Could be split into 9668 separate s_waitcnt 9669 vmcnt(0), s_waitcnt 9670 vscnt(0) and s_waitcnt 9671 lgkmcnt(0) to allow 9672 them to be 9673 independently moved 9674 according to the 9675 following rules. 9676 - s_waitcnt vmcnt(0) 9677 must happen after 9678 any preceding 9679 global/generic load/load 9680 atomic/ 9681 atomicrmw-with-return-value. 9682 - s_waitcnt vscnt(0) 9683 must happen after 9684 any preceding 9685 global/generic 9686 store/store 9687 atomic/ 9688 atomicrmw-no-return-value. 9689 - s_waitcnt lgkmcnt(0) 9690 must happen after 9691 any preceding 9692 local/generic 9693 load/store/load 9694 atomic/store 9695 atomic/atomicrmw. 9696 - Must happen before 9697 the following 9698 atomicrmw. 9699 - Ensures that all 9700 memory operations 9701 have 9702 completed before 9703 performing the 9704 atomicrmw that is 9705 being released. 9706 9707 2. flat_atomic 9708 3. s_waitcnt lgkmcnt(0) & 9709 vmcnt(0) & vscnt(0) 9710 9711 - If CU wavefront execution 9712 mode, omit vmcnt(0) and 9713 vscnt(0). 9714 - If OpenCL, omit lgkmcnt(0). 9715 - Must happen before 9716 the following 9717 buffer_gl0_inv. 9718 - Ensures any 9719 following global 9720 data read is no 9721 older than the load 9722 atomic value being 9723 acquired. 9724 9725 3. buffer_gl0_inv 9726 9727 - If CU wavefront execution 9728 mode, omit. 9729 - Ensures that 9730 following 9731 loads will not see 9732 stale data. 9733 9734 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 9735 - system vmcnt(0) & vscnt(0) 9736 9737 - If OpenCL, omit 9738 lgkmcnt(0). 9739 - Could be split into 9740 separate s_waitcnt 9741 vmcnt(0), s_waitcnt 9742 vscnt(0) and s_waitcnt 9743 lgkmcnt(0) to allow 9744 them to be 9745 independently moved 9746 according to the 9747 following rules. 9748 - s_waitcnt vmcnt(0) 9749 must happen after 9750 any preceding 9751 global/generic 9752 load/load atomic/ 9753 atomicrmw-with-return-value. 9754 - s_waitcnt vscnt(0) 9755 must happen after 9756 any preceding 9757 global/generic 9758 store/store atomic/ 9759 atomicrmw-no-return-value. 9760 - s_waitcnt lgkmcnt(0) 9761 must happen after 9762 any preceding 9763 local/generic 9764 load/store/load 9765 atomic/store 9766 atomic/atomicrmw. 9767 - Must happen before 9768 the following 9769 atomicrmw. 9770 - Ensures that all 9771 memory operations 9772 to global have 9773 completed before 9774 performing the 9775 atomicrmw that is 9776 being released. 9777 9778 2. buffer/global_atomic 9779 3. s_waitcnt vm/vscnt(0) 9780 9781 - Use vmcnt(0) if atomic with 9782 return and vscnt(0) if 9783 atomic with no-return. 9784 - Must happen before 9785 following 9786 buffer_gl*_inv. 9787 - Ensures the 9788 atomicrmw has 9789 completed before 9790 invalidating the 9791 caches. 9792 9793 4. buffer_gl0_inv; 9794 buffer_gl1_inv 9795 9796 - Must happen before 9797 any following 9798 global/generic 9799 load/load 9800 atomic/atomicrmw. 9801 - Ensures that 9802 following loads 9803 will not see stale 9804 global data. 9805 9806 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 9807 - system vmcnt(0) & vscnt(0) 9808 9809 - If OpenCL, omit 9810 lgkmcnt(0). 9811 - Could be split into 9812 separate s_waitcnt 9813 vmcnt(0), s_waitcnt 9814 vscnt(0), and s_waitcnt 9815 lgkmcnt(0) to allow 9816 them to be 9817 independently moved 9818 according to the 9819 following rules. 9820 - s_waitcnt vmcnt(0) 9821 must happen after 9822 any preceding 9823 global/generic 9824 load/load atomic 9825 atomicrmw-with-return-value. 9826 - s_waitcnt vscnt(0) 9827 must happen after 9828 any preceding 9829 global/generic 9830 store/store atomic/ 9831 atomicrmw-no-return-value. 9832 - s_waitcnt lgkmcnt(0) 9833 must happen after 9834 any preceding 9835 local/generic 9836 load/store/load 9837 atomic/store 9838 atomic/atomicrmw. 9839 - Must happen before 9840 the following 9841 atomicrmw. 9842 - Ensures that all 9843 memory operations 9844 have 9845 completed before 9846 performing the 9847 atomicrmw that is 9848 being released. 9849 9850 2. flat_atomic 9851 3. s_waitcnt vm/vscnt(0) & 9852 lgkmcnt(0) 9853 9854 - If OpenCL, omit 9855 lgkmcnt(0). 9856 - Use vmcnt(0) if atomic with 9857 return and vscnt(0) if 9858 atomic with no-return. 9859 - Must happen before 9860 following 9861 buffer_gl*_inv. 9862 - Ensures the 9863 atomicrmw has 9864 completed before 9865 invalidating the 9866 caches. 9867 9868 4. buffer_gl0_inv; 9869 buffer_gl1_inv 9870 9871 - Must happen before 9872 any following 9873 global/generic 9874 load/load 9875 atomic/atomicrmw. 9876 - Ensures that 9877 following loads 9878 will not see stale 9879 global data. 9880 9881 fence acq_rel - singlethread *none* *none* 9882 - wavefront 9883 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9884 vmcnt(0) & vscnt(0) 9885 9886 - If CU wavefront execution 9887 mode, omit vmcnt(0) and 9888 vscnt(0). 9889 - If OpenCL and 9890 address space is 9891 not generic, omit 9892 lgkmcnt(0). 9893 - If OpenCL and 9894 address space is 9895 local, omit 9896 vmcnt(0) and vscnt(0). 9897 - However, 9898 since LLVM 9899 currently has no 9900 address space on 9901 the fence need to 9902 conservatively 9903 always generate 9904 (see comment for 9905 previous fence). 9906 - Could be split into 9907 separate s_waitcnt 9908 vmcnt(0), s_waitcnt 9909 vscnt(0) and s_waitcnt 9910 lgkmcnt(0) to allow 9911 them to be 9912 independently moved 9913 according to the 9914 following rules. 9915 - s_waitcnt vmcnt(0) 9916 must happen after 9917 any preceding 9918 global/generic 9919 load/load 9920 atomic/ 9921 atomicrmw-with-return-value. 9922 - s_waitcnt vscnt(0) 9923 must happen after 9924 any preceding 9925 global/generic 9926 store/store atomic/ 9927 atomicrmw-no-return-value. 9928 - s_waitcnt lgkmcnt(0) 9929 must happen after 9930 any preceding 9931 local/generic 9932 load/store/load 9933 atomic/store atomic/ 9934 atomicrmw. 9935 - Must happen before 9936 any following 9937 global/generic 9938 load/load 9939 atomic/store/store 9940 atomic/atomicrmw. 9941 - Ensures that all 9942 memory operations 9943 have 9944 completed before 9945 performing any 9946 following global 9947 memory operations. 9948 - Ensures that the 9949 preceding 9950 local/generic load 9951 atomic/atomicrmw 9952 with an equal or 9953 wider sync scope 9954 and memory ordering 9955 stronger than 9956 unordered (this is 9957 termed the 9958 acquire-fence-paired-atomic) 9959 has completed 9960 before following 9961 global memory 9962 operations. This 9963 satisfies the 9964 requirements of 9965 acquire. 9966 - Ensures that all 9967 previous memory 9968 operations have 9969 completed before a 9970 following 9971 local/generic store 9972 atomic/atomicrmw 9973 with an equal or 9974 wider sync scope 9975 and memory ordering 9976 stronger than 9977 unordered (this is 9978 termed the 9979 release-fence-paired-atomic). 9980 This satisfies the 9981 requirements of 9982 release. 9983 - Must happen before 9984 the following 9985 buffer_gl0_inv. 9986 - Ensures that the 9987 acquire-fence-paired 9988 atomic has completed 9989 before invalidating 9990 the 9991 cache. Therefore 9992 any following 9993 locations read must 9994 be no older than 9995 the value read by 9996 the 9997 acquire-fence-paired-atomic. 9998 9999 3. buffer_gl0_inv 10000 10001 - If CU wavefront execution 10002 mode, omit. 10003 - Ensures that 10004 following 10005 loads will not see 10006 stale data. 10007 10008 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 10009 - system vmcnt(0) & vscnt(0) 10010 10011 - If OpenCL and 10012 address space is 10013 not generic, omit 10014 lgkmcnt(0). 10015 - If OpenCL and 10016 address space is 10017 local, omit 10018 vmcnt(0) and vscnt(0). 10019 - However, since LLVM 10020 currently has no 10021 address space on 10022 the fence need to 10023 conservatively 10024 always generate 10025 (see comment for 10026 previous fence). 10027 - Could be split into 10028 separate s_waitcnt 10029 vmcnt(0), s_waitcnt 10030 vscnt(0) and s_waitcnt 10031 lgkmcnt(0) to allow 10032 them to be 10033 independently moved 10034 according to the 10035 following rules. 10036 - s_waitcnt vmcnt(0) 10037 must happen after 10038 any preceding 10039 global/generic 10040 load/load 10041 atomic/ 10042 atomicrmw-with-return-value. 10043 - s_waitcnt vscnt(0) 10044 must happen after 10045 any preceding 10046 global/generic 10047 store/store atomic/ 10048 atomicrmw-no-return-value. 10049 - s_waitcnt lgkmcnt(0) 10050 must happen after 10051 any preceding 10052 local/generic 10053 load/store/load 10054 atomic/store 10055 atomic/atomicrmw. 10056 - Must happen before 10057 the following 10058 buffer_gl*_inv. 10059 - Ensures that the 10060 preceding 10061 global/local/generic 10062 load 10063 atomic/atomicrmw 10064 with an equal or 10065 wider sync scope 10066 and memory ordering 10067 stronger than 10068 unordered (this is 10069 termed the 10070 acquire-fence-paired-atomic) 10071 has completed 10072 before invalidating 10073 the caches. This 10074 satisfies the 10075 requirements of 10076 acquire. 10077 - Ensures that all 10078 previous memory 10079 operations have 10080 completed before a 10081 following 10082 global/local/generic 10083 store 10084 atomic/atomicrmw 10085 with an equal or 10086 wider sync scope 10087 and memory ordering 10088 stronger than 10089 unordered (this is 10090 termed the 10091 release-fence-paired-atomic). 10092 This satisfies the 10093 requirements of 10094 release. 10095 10096 2. buffer_gl0_inv; 10097 buffer_gl1_inv 10098 10099 - Must happen before 10100 any following 10101 global/generic 10102 load/load 10103 atomic/store/store 10104 atomic/atomicrmw. 10105 - Ensures that 10106 following loads 10107 will not see stale 10108 global data. This 10109 satisfies the 10110 requirements of 10111 acquire. 10112 10113 **Sequential Consistent Atomic** 10114 ------------------------------------------------------------------------------------ 10115 load atomic seq_cst - singlethread - global *Same as corresponding 10116 - wavefront - local load atomic acquire, 10117 - generic except must generated 10118 all instructions even 10119 for OpenCL.* 10120 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 10121 - generic vmcnt(0) & vscnt(0) 10122 10123 - If CU wavefront execution 10124 mode, omit vmcnt(0) and 10125 vscnt(0). 10126 - Could be split into 10127 separate s_waitcnt 10128 vmcnt(0), s_waitcnt 10129 vscnt(0), and s_waitcnt 10130 lgkmcnt(0) to allow 10131 them to be 10132 independently moved 10133 according to the 10134 following rules. 10135 - s_waitcnt lgkmcnt(0) must 10136 happen after 10137 preceding 10138 local/generic load 10139 atomic/store 10140 atomic/atomicrmw 10141 with memory 10142 ordering of seq_cst 10143 and with equal or 10144 wider sync scope. 10145 (Note that seq_cst 10146 fences have their 10147 own s_waitcnt 10148 lgkmcnt(0) and so do 10149 not need to be 10150 considered.) 10151 - s_waitcnt vmcnt(0) 10152 must happen after 10153 preceding 10154 global/generic load 10155 atomic/ 10156 atomicrmw-with-return-value 10157 with memory 10158 ordering of seq_cst 10159 and with equal or 10160 wider sync scope. 10161 (Note that seq_cst 10162 fences have their 10163 own s_waitcnt 10164 vmcnt(0) and so do 10165 not need to be 10166 considered.) 10167 - s_waitcnt vscnt(0) 10168 Must happen after 10169 preceding 10170 global/generic store 10171 atomic/ 10172 atomicrmw-no-return-value 10173 with memory 10174 ordering of seq_cst 10175 and with equal or 10176 wider sync scope. 10177 (Note that seq_cst 10178 fences have their 10179 own s_waitcnt 10180 vscnt(0) and so do 10181 not need to be 10182 considered.) 10183 - Ensures any 10184 preceding 10185 sequential 10186 consistent global/local 10187 memory instructions 10188 have completed 10189 before executing 10190 this sequentially 10191 consistent 10192 instruction. This 10193 prevents reordering 10194 a seq_cst store 10195 followed by a 10196 seq_cst load. (Note 10197 that seq_cst is 10198 stronger than 10199 acquire/release as 10200 the reordering of 10201 load acquire 10202 followed by a store 10203 release is 10204 prevented by the 10205 s_waitcnt of 10206 the release, but 10207 there is nothing 10208 preventing a store 10209 release followed by 10210 load acquire from 10211 completing out of 10212 order. The s_waitcnt 10213 could be placed after 10214 seq_store or before 10215 the seq_load. We 10216 choose the load to 10217 make the s_waitcnt be 10218 as late as possible 10219 so that the store 10220 may have already 10221 completed.) 10222 10223 2. *Following 10224 instructions same as 10225 corresponding load 10226 atomic acquire, 10227 except must generated 10228 all instructions even 10229 for OpenCL.* 10230 load atomic seq_cst - workgroup - local 10231 10232 1. s_waitcnt vmcnt(0) & vscnt(0) 10233 10234 - If CU wavefront execution 10235 mode, omit. 10236 - Could be split into 10237 separate s_waitcnt 10238 vmcnt(0) and s_waitcnt 10239 vscnt(0) to allow 10240 them to be 10241 independently moved 10242 according to the 10243 following rules. 10244 - s_waitcnt vmcnt(0) 10245 Must happen after 10246 preceding 10247 global/generic load 10248 atomic/ 10249 atomicrmw-with-return-value 10250 with memory 10251 ordering of seq_cst 10252 and with equal or 10253 wider sync scope. 10254 (Note that seq_cst 10255 fences have their 10256 own s_waitcnt 10257 vmcnt(0) and so do 10258 not need to be 10259 considered.) 10260 - s_waitcnt vscnt(0) 10261 Must happen after 10262 preceding 10263 global/generic store 10264 atomic/ 10265 atomicrmw-no-return-value 10266 with memory 10267 ordering of seq_cst 10268 and with equal or 10269 wider sync scope. 10270 (Note that seq_cst 10271 fences have their 10272 own s_waitcnt 10273 vscnt(0) and so do 10274 not need to be 10275 considered.) 10276 - Ensures any 10277 preceding 10278 sequential 10279 consistent global 10280 memory instructions 10281 have completed 10282 before executing 10283 this sequentially 10284 consistent 10285 instruction. This 10286 prevents reordering 10287 a seq_cst store 10288 followed by a 10289 seq_cst load. (Note 10290 that seq_cst is 10291 stronger than 10292 acquire/release as 10293 the reordering of 10294 load acquire 10295 followed by a store 10296 release is 10297 prevented by the 10298 s_waitcnt of 10299 the release, but 10300 there is nothing 10301 preventing a store 10302 release followed by 10303 load acquire from 10304 completing out of 10305 order. The s_waitcnt 10306 could be placed after 10307 seq_store or before 10308 the seq_load. We 10309 choose the load to 10310 make the s_waitcnt be 10311 as late as possible 10312 so that the store 10313 may have already 10314 completed.) 10315 10316 2. *Following 10317 instructions same as 10318 corresponding load 10319 atomic acquire, 10320 except must generated 10321 all instructions even 10322 for OpenCL.* 10323 10324 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 10325 - system - generic vmcnt(0) & vscnt(0) 10326 10327 - Could be split into 10328 separate s_waitcnt 10329 vmcnt(0), s_waitcnt 10330 vscnt(0) and s_waitcnt 10331 lgkmcnt(0) to allow 10332 them to be 10333 independently moved 10334 according to the 10335 following rules. 10336 - s_waitcnt lgkmcnt(0) 10337 must happen after 10338 preceding 10339 local load 10340 atomic/store 10341 atomic/atomicrmw 10342 with memory 10343 ordering of seq_cst 10344 and with equal or 10345 wider sync scope. 10346 (Note that seq_cst 10347 fences have their 10348 own s_waitcnt 10349 lgkmcnt(0) and so do 10350 not need to be 10351 considered.) 10352 - s_waitcnt vmcnt(0) 10353 must happen after 10354 preceding 10355 global/generic load 10356 atomic/ 10357 atomicrmw-with-return-value 10358 with memory 10359 ordering of seq_cst 10360 and with equal or 10361 wider sync scope. 10362 (Note that seq_cst 10363 fences have their 10364 own s_waitcnt 10365 vmcnt(0) and so do 10366 not need to be 10367 considered.) 10368 - s_waitcnt vscnt(0) 10369 Must happen after 10370 preceding 10371 global/generic store 10372 atomic/ 10373 atomicrmw-no-return-value 10374 with memory 10375 ordering of seq_cst 10376 and with equal or 10377 wider sync scope. 10378 (Note that seq_cst 10379 fences have their 10380 own s_waitcnt 10381 vscnt(0) and so do 10382 not need to be 10383 considered.) 10384 - Ensures any 10385 preceding 10386 sequential 10387 consistent global 10388 memory instructions 10389 have completed 10390 before executing 10391 this sequentially 10392 consistent 10393 instruction. This 10394 prevents reordering 10395 a seq_cst store 10396 followed by a 10397 seq_cst load. (Note 10398 that seq_cst is 10399 stronger than 10400 acquire/release as 10401 the reordering of 10402 load acquire 10403 followed by a store 10404 release is 10405 prevented by the 10406 s_waitcnt of 10407 the release, but 10408 there is nothing 10409 preventing a store 10410 release followed by 10411 load acquire from 10412 completing out of 10413 order. The s_waitcnt 10414 could be placed after 10415 seq_store or before 10416 the seq_load. We 10417 choose the load to 10418 make the s_waitcnt be 10419 as late as possible 10420 so that the store 10421 may have already 10422 completed.) 10423 10424 2. *Following 10425 instructions same as 10426 corresponding load 10427 atomic acquire, 10428 except must generated 10429 all instructions even 10430 for OpenCL.* 10431 store atomic seq_cst - singlethread - global *Same as corresponding 10432 - wavefront - local store atomic release, 10433 - workgroup - generic except must generated 10434 - agent all instructions even 10435 - system for OpenCL.* 10436 atomicrmw seq_cst - singlethread - global *Same as corresponding 10437 - wavefront - local atomicrmw acq_rel, 10438 - workgroup - generic except must generated 10439 - agent all instructions even 10440 - system for OpenCL.* 10441 fence seq_cst - singlethread *none* *Same as corresponding 10442 - wavefront fence acq_rel, 10443 - workgroup except must generated 10444 - agent all instructions even 10445 - system for OpenCL.* 10446 ============ ============ ============== ========== ================================ 10447 10448Trap Handler ABI 10449~~~~~~~~~~~~~~~~ 10450 10451For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 10452runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 10453supports the ``s_trap`` instruction. For usage see: 10454 10455- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 10456- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 10457- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table` 10458 10459 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 10460 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 10461 10462 =================== =============== =============== ======================================= 10463 Usage Code Sequence Trap Handler Description 10464 Inputs 10465 =================== =============== =============== ======================================= 10466 reserved ``s_trap 0x00`` Reserved by hardware. 10467 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 10468 ``queue_ptr`` intrinsic (not implemented). 10469 ``VGPR0``: 10470 ``arg`` 10471 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10472 ``queue_ptr`` the trap instruction. The associated 10473 queue is signalled to put it into the 10474 error state. When the queue is put in 10475 the error state, the waves executing 10476 dispatches on the queue will be 10477 terminated. 10478 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10479 as a no-operation. The trap handler 10480 is entered and immediately returns to 10481 continue execution of the wavefront. 10482 - If the debugger is enabled, causes 10483 the debug trap to be reported by the 10484 debugger and the wavefront is put in 10485 the halt state with the PC at the 10486 instruction. The debugger must 10487 increment the PC and resume the wave. 10488 reserved ``s_trap 0x04`` Reserved. 10489 reserved ``s_trap 0x05`` Reserved. 10490 reserved ``s_trap 0x06`` Reserved. 10491 reserved ``s_trap 0x07`` Reserved. 10492 reserved ``s_trap 0x08`` Reserved. 10493 reserved ``s_trap 0xfe`` Reserved. 10494 reserved ``s_trap 0xff`` Reserved. 10495 =================== =============== =============== ======================================= 10496 10497.. 10498 10499 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 10500 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 10501 10502 =================== =============== =============== ======================================= 10503 Usage Code Sequence Trap Handler Description 10504 Inputs 10505 =================== =============== =============== ======================================= 10506 reserved ``s_trap 0x00`` Reserved by hardware. 10507 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 10508 breakpoints. Causes wave to be halted 10509 with the PC at the trap instruction. 10510 The debugger is responsible to resume 10511 the wave, including the instruction 10512 that the breakpoint overwrote. 10513 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10514 ``queue_ptr`` the trap instruction. The associated 10515 queue is signalled to put it into the 10516 error state. When the queue is put in 10517 the error state, the waves executing 10518 dispatches on the queue will be 10519 terminated. 10520 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10521 as a no-operation. The trap handler 10522 is entered and immediately returns to 10523 continue execution of the wavefront. 10524 - If the debugger is enabled, causes 10525 the debug trap to be reported by the 10526 debugger and the wavefront is put in 10527 the halt state with the PC at the 10528 instruction. The debugger must 10529 increment the PC and resume the wave. 10530 reserved ``s_trap 0x04`` Reserved. 10531 reserved ``s_trap 0x05`` Reserved. 10532 reserved ``s_trap 0x06`` Reserved. 10533 reserved ``s_trap 0x07`` Reserved. 10534 reserved ``s_trap 0x08`` Reserved. 10535 reserved ``s_trap 0xfe`` Reserved. 10536 reserved ``s_trap 0xff`` Reserved. 10537 =================== =============== =============== ======================================= 10538 10539.. 10540 10541 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 10542 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table 10543 10544 =================== =============== ================ ================= ======================================= 10545 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description 10546 =================== =============== ================ ================= ======================================= 10547 reserved ``s_trap 0x00`` Reserved by hardware. 10548 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 10549 breakpoints. Causes wave to be halted 10550 with the PC at the trap instruction. 10551 The debugger is responsible to resume 10552 the wave, including the instruction 10553 that the breakpoint overwrote. 10554 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 10555 ``queue_ptr`` the trap instruction. The associated 10556 queue is signalled to put it into the 10557 error state. When the queue is put in 10558 the error state, the waves executing 10559 dispatches on the queue will be 10560 terminated. 10561 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 10562 as a no-operation. The trap handler 10563 is entered and immediately returns to 10564 continue execution of the wavefront. 10565 - If the debugger is enabled, causes 10566 the debug trap to be reported by the 10567 debugger and the wavefront is put in 10568 the halt state with the PC at the 10569 instruction. The debugger must 10570 increment the PC and resume the wave. 10571 reserved ``s_trap 0x04`` Reserved. 10572 reserved ``s_trap 0x05`` Reserved. 10573 reserved ``s_trap 0x06`` Reserved. 10574 reserved ``s_trap 0x07`` Reserved. 10575 reserved ``s_trap 0x08`` Reserved. 10576 reserved ``s_trap 0xfe`` Reserved. 10577 reserved ``s_trap 0xff`` Reserved. 10578 =================== =============== ================ ================= ======================================= 10579 10580.. _amdgpu-amdhsa-function-call-convention: 10581 10582Call Convention 10583~~~~~~~~~~~~~~~ 10584 10585.. note:: 10586 10587 This section is currently incomplete and has inaccuracies. It is WIP that will 10588 be updated as information is determined. 10589 10590See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 10591addresses. Unswizzled addresses are normal linear addresses. 10592 10593.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 10594 10595Kernel Functions 10596++++++++++++++++ 10597 10598This section describes the call convention ABI for the outer kernel function. 10599 10600See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 10601convention. 10602 10603The following is not part of the AMDGPU kernel calling convention but describes 10604how the AMDGPU implements function calls: 10605 106061. Clang decides the kernarg layout to match the *HSA Programmer's Language 10607 Reference* [HSA]_. 10608 10609 - All structs are passed directly. 10610 - Lambda values are passed *TBA*. 10611 10612 .. TODO:: 10613 10614 - Does this really follow HSA rules? Or are structs >16 bytes passed 10615 by-value struct? 10616 - What is ABI for lambda values? 10617 106184. The kernel performs certain setup in its prolog, as described in 10619 :ref:`amdgpu-amdhsa-kernel-prolog`. 10620 10621.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 10622 10623Non-Kernel Functions 10624++++++++++++++++++++ 10625 10626This section describes the call convention ABI for functions other than the 10627outer kernel function. 10628 10629If a kernel has function calls then scratch is always allocated and used for 10630the call stack which grows from low address to high address using the swizzled 10631scratch address space. 10632 10633On entry to a function: 10634 106351. SGPR0-3 contain a V# with the following properties (see 10636 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 10637 10638 * Base address pointing to the beginning of the wavefront scratch backing 10639 memory. 10640 * Swizzled with dword element size and stride of wavefront size elements. 10641 106422. The FLAT_SCRATCH register pair is setup. See 10643 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 106443. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 10645 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 106464. The EXEC register is set to the lanes active on entry to the function. 106475. MODE register: *TBD* 106486. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 10649 below. 106507. SGPR30-31 return address (RA). The code address that the function must 10651 return to when it completes. The value is undefined if the function is *no 10652 return*. 106538. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 10654 offset relative to the beginning of the wavefront scratch backing memory. 10655 10656 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 10657 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 10658 manner. 10659 10660 The unswizzled SP value can be converted into the swizzled SP value by: 10661 10662 | swizzled SP = unswizzled SP / wavefront size 10663 10664 This may be used to obtain the private address space address of stack 10665 objects and to convert this address to a flat address by adding the flat 10666 scratch aperture base address. 10667 10668 The swizzled SP value is always 4 bytes aligned for the ``r600`` 10669 architecture and 16 byte aligned for the ``amdgcn`` architecture. 10670 10671 .. note:: 10672 10673 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 10674 OpenCL language which has the largest base type defined as 16 bytes. 10675 10676 On entry, the swizzled SP value is the address of the first function 10677 argument passed on the stack. Other stack passed arguments are positive 10678 offsets from the entry swizzled SP value. 10679 10680 The function may use positive offsets beyond the last stack passed argument 10681 for stack allocated local variables and register spill slots. If necessary, 10682 the function may align these to greater alignment than 16 bytes. After these 10683 the function may dynamically allocate space for such things as runtime sized 10684 ``alloca`` local allocations. 10685 10686 If the function calls another function, it will place any stack allocated 10687 arguments after the last local allocation and adjust SGPR32 to the address 10688 after the last local allocation. 10689 106909. All other registers are unspecified. 1069110. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 10692 to the function. 10693 10694On exit from a function: 10695 106961. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 10697 described below. Any registers used are considered clobbered registers. 106982. The following registers are preserved and have the same value as on entry: 10699 10700 * FLAT_SCRATCH 10701 * EXEC 10702 * GFX6-GFX8: M0 10703 * All SGPR registers except the clobbered registers of SGPR4-31. 10704 * VGPR40-47 10705 * VGPR56-63 10706 * VGPR72-79 10707 * VGPR88-95 10708 * VGPR104-111 10709 * VGPR120-127 10710 * VGPR136-143 10711 * VGPR152-159 10712 * VGPR168-175 10713 * VGPR184-191 10714 * VGPR200-207 10715 * VGPR216-223 10716 * VGPR232-239 10717 * VGPR248-255 10718 10719 .. note:: 10720 10721 Except the argument registers, the VGPRs clobbered and the preserved 10722 registers are intermixed at regular intervals in order to keep a 10723 similar ratio independent of the number of allocated VGPRs. 10724 10725 * Lanes of all VGPRs that are inactive at the call site. 10726 10727 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 10728 optimization may mark some of clobbered SGPR and VGPR registers as 10729 preserved if it can be determined that the called function does not change 10730 their value. 10731 107322. The PC is set to the RA provided on entry. 107333. MODE register: *TBD*. 107344. All other registers are clobbered. 107355. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 10736 function is available to the caller. 10737 10738.. TODO:: 10739 10740 - On gfx908 are all ACC registers clobbered? 10741 10742 - How are function results returned? The address of structured types is passed 10743 by reference, but what about other types? 10744 10745The function input arguments are made up of the formal arguments explicitly 10746declared by the source language function plus the implicit input arguments used 10747by the implementation. 10748 10749The source language input arguments are: 10750 107511. Any source language implicit ``this`` or ``self`` argument comes first as a 10752 pointer type. 107532. Followed by the function formal arguments in left to right source order. 10754 10755The source language result arguments are: 10756 107571. The function result argument. 10758 10759The source language input or result struct type arguments that are less than or 10760equal to 16 bytes, are decomposed recursively into their base type fields, and 10761each field is passed as if a separate argument. For input arguments, if the 10762called function requires the struct to be in memory, for example because its 10763address is taken, then the function body is responsible for allocating a stack 10764location and copying the field arguments into it. Clang terms this *direct 10765struct*. 10766 10767The source language input struct type arguments that are greater than 16 bytes, 10768are passed by reference. The caller is responsible for allocating a stack 10769location to make a copy of the struct value and pass the address as the input 10770argument. The called function is responsible to perform the dereference when 10771accessing the input argument. Clang terms this *by-value struct*. 10772 10773A source language result struct type argument that is greater than 16 bytes, is 10774returned by reference. The caller is responsible for allocating a stack location 10775to hold the result value and passes the address as the last input argument 10776(before the implicit input arguments). In this case there are no result 10777arguments. The called function is responsible to perform the dereference when 10778storing the result value. Clang terms this *structured return (sret)*. 10779 10780*TODO: correct the ``sret`` definition.* 10781 10782.. TODO:: 10783 10784 Is this definition correct? Or is ``sret`` only used if passing in registers, and 10785 pass as non-decomposed struct as stack argument? Or something else? Is the 10786 memory location in the caller stack frame, or a stack memory argument and so 10787 no address is passed as the caller can directly write to the argument stack 10788 location? But then the stack location is still live after return. If an 10789 argument stack location is it the first stack argument or the last one? 10790 10791Lambda argument types are treated as struct types with an implementation defined 10792set of fields. 10793 10794.. TODO:: 10795 10796 Need to specify the ABI for lambda types for AMDGPU. 10797 10798For AMDGPU backend all source language arguments (including the decomposed 10799struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 10800they are passed in SGPRs. 10801 10802The AMDGPU backend walks the function call graph from the leaves to determine 10803which implicit input arguments are used, propagating to each caller of the 10804function. The used implicit arguments are appended to the function arguments 10805after the source language arguments in the following order: 10806 10807.. TODO:: 10808 10809 Is recursion or external functions supported? 10810 108111. Work-Item ID (1 VGPR) 10812 10813 The X, Y and Z work-item ID are packed into a single VGRP with the following 10814 layout. Only fields actually used by the function are set. The other bits 10815 are undefined. 10816 10817 The values come from the initial kernel execution state. See 10818 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 10819 10820 .. table:: Work-item implicit argument layout 10821 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 10822 10823 ======= ======= ============== 10824 Bits Size Field Name 10825 ======= ======= ============== 10826 9:0 10 bits X Work-Item ID 10827 19:10 10 bits Y Work-Item ID 10828 29:20 10 bits Z Work-Item ID 10829 31:30 2 bits Unused 10830 ======= ======= ============== 10831 108322. Dispatch Ptr (2 SGPRs) 10833 10834 The value comes from the initial kernel execution state. See 10835 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10836 108373. Queue Ptr (2 SGPRs) 10838 10839 The value comes from the initial kernel execution state. See 10840 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10841 108424. Kernarg Segment Ptr (2 SGPRs) 10843 10844 The value comes from the initial kernel execution state. See 10845 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10846 108475. Dispatch id (2 SGPRs) 10848 10849 The value comes from the initial kernel execution state. See 10850 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10851 108526. Work-Group ID X (1 SGPR) 10853 10854 The value comes from the initial kernel execution state. See 10855 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10856 108577. Work-Group ID Y (1 SGPR) 10858 10859 The value comes from the initial kernel execution state. See 10860 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10861 108628. Work-Group ID Z (1 SGPR) 10863 10864 The value comes from the initial kernel execution state. See 10865 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10866 108679. Implicit Argument Ptr (2 SGPRs) 10868 10869 The value is computed by adding an offset to Kernarg Segment Ptr to get the 10870 global address space pointer to the first kernarg implicit argument. 10871 10872The input and result arguments are assigned in order in the following manner: 10873 10874.. note:: 10875 10876 There are likely some errors and omissions in the following description that 10877 need correction. 10878 10879 .. TODO:: 10880 10881 Check the Clang source code to decipher how function arguments and return 10882 results are handled. Also see the AMDGPU specific values used. 10883 10884* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 10885 VGPR31. 10886 10887 If there are more arguments than will fit in these registers, the remaining 10888 arguments are allocated on the stack in order on naturally aligned 10889 addresses. 10890 10891 .. TODO:: 10892 10893 How are overly aligned structures allocated on the stack? 10894 10895* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 10896 SGPR29. 10897 10898 If there are more arguments than will fit in these registers, the remaining 10899 arguments are allocated on the stack in order on naturally aligned 10900 addresses. 10901 10902Note that decomposed struct type arguments may have some fields passed in 10903registers and some in memory. 10904 10905.. TODO:: 10906 10907 So, a struct which can pass some fields as decomposed register arguments, will 10908 pass the rest as decomposed stack elements? But an argument that will not start 10909 in registers will not be decomposed and will be passed as a non-decomposed 10910 stack value? 10911 10912The following is not part of the AMDGPU function calling convention but 10913describes how the AMDGPU implements function calls: 10914 109151. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 10916 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 10917 are used, or for the reasons defined in ``SIFrameLowering``. 109182. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 10919 to access the incoming stack arguments in the function. The BP is needed 10920 only when the function requires the runtime stack alignment. 10921 109223. Allocating SGPR arguments on the stack are not supported. 10923 109244. No CFI is currently generated. See 10925 :ref:`amdgpu-dwarf-call-frame-information`. 10926 10927 .. note:: 10928 10929 CFI will be generated that defines the CFA as the unswizzled address 10930 relative to the wave scratch base in the unswizzled private address space 10931 of the lowest address stack allocated local variable. 10932 10933 ``DW_AT_frame_base`` will be defined as the swizzled address in the 10934 swizzled private address space by dividing the CFA by the wavefront size 10935 (since CFA is always at least dword aligned which matches the scratch 10936 swizzle element size). 10937 10938 If no dynamic stack alignment was performed, the stack allocated arguments 10939 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 10940 local variables and register spill slots are accessed as positive offsets 10941 relative to ``DW_AT_frame_base``. 10942 109435. Function argument passing is implemented by copying the input physical 10944 registers to virtual registers on entry. The register allocator can spill if 10945 necessary. These are copied back to physical registers at call sites. The 10946 net effect is that each function call can have these values in entirely 10947 distinct locations. The IPRA can help avoid shuffling argument registers. 109486. Call sites are implemented by setting up the arguments at positive offsets 10949 from SP. Then SP is incremented to account for the known frame size before 10950 the call and decremented after the call. 10951 10952 .. note:: 10953 10954 The CFI will reflect the changed calculation needed to compute the CFA 10955 from SP. 10956 109577. 4 byte spill slots are used in the stack frame. One slot is allocated for an 10958 emergency spill slot. Buffer instructions are used for stack accesses and 10959 not the ``flat_scratch`` instruction. 10960 10961 .. TODO:: 10962 10963 Explain when the emergency spill slot is used. 10964 10965.. TODO:: 10966 10967 Possible broken issues: 10968 10969 - Stack arguments must be aligned to required alignment. 10970 - Stack is aligned to max(16, max formal argument alignment) 10971 - Direct argument < 64 bits should check register budget. 10972 - Register budget calculation should respect ``inreg`` for SGPR. 10973 - SGPR overflow is not handled. 10974 - struct with 1 member unpeeling is not checking size of member. 10975 - ``sret`` is after ``this`` pointer. 10976 - Caller is not implementing stack realignment: need an extra pointer. 10977 - Should say AMDGPU passes FP rather than SP. 10978 - Should CFI define CFA as address of locals or arguments. Difference is 10979 apparent when have implemented dynamic alignment. 10980 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 10981 highest address of stack frame and use negative offset for locals. Would 10982 allow SP to be the same as FP and could support signal-handler-like as now 10983 have a real SP for the top of the stack. 10984 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 10985 arguments? 10986 10987AMDPAL 10988------ 10989 10990This section provides code conventions used when the target triple OS is 10991``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 10992from the application/runtime to each invocation of a hardware shader. These 10993parameters include both generic, application-controlled parameters called 10994*user data* as well as system-generated parameters that are a product of the 10995draw or dispatch execution. 10996 10997User Data 10998~~~~~~~~~ 10999 11000Each hardware stage has a set of 32-bit *user data registers* which can be 11001written from a command buffer and then loaded into SGPRs when waves are launched 11002via a subsequent dispatch or draw operation. This is the way most arguments are 11003passed from the application/runtime to a hardware shader. 11004 11005Compute User Data 11006~~~~~~~~~~~~~~~~~ 11007 11008Compute shader user data mappings are simpler than graphics shaders and have a 11009fixed mapping. 11010 11011Note that there are always 10 available *user data entries* in registers - 11012entries beyond that limit must be fetched from memory (via the spill table 11013pointer) by the shader. 11014 11015 .. table:: PAL Compute Shader User Data Registers 11016 :name: pal-compute-user-data-registers 11017 11018 ============= ================================ 11019 User Register Description 11020 ============= ================================ 11021 0 Global Internal Table (32-bit pointer) 11022 1 Per-Shader Internal Table (32-bit pointer) 11023 2 - 11 Application-Controlled User Data (10 32-bit values) 11024 12 Spill Table (32-bit pointer) 11025 13 - 14 Thread Group Count (64-bit pointer) 11026 15 GDS Range 11027 ============= ================================ 11028 11029Graphics User Data 11030~~~~~~~~~~~~~~~~~~ 11031 11032Graphics pipelines support a much more flexible user data mapping: 11033 11034 .. table:: PAL Graphics Shader User Data Registers 11035 :name: pal-graphics-user-data-registers 11036 11037 ============= ================================ 11038 User Register Description 11039 ============= ================================ 11040 0 Global Internal Table (32-bit pointer) 11041 + Per-Shader Internal Table (32-bit pointer) 11042 + 1-15 Application Controlled User Data 11043 (1-15 Contiguous 32-bit Values in Registers) 11044 + Spill Table (32-bit pointer) 11045 + Draw Index (First Stage Only) 11046 + Vertex Offset (First Stage Only) 11047 + Instance Offset (First Stage Only) 11048 ============= ================================ 11049 11050 The placement of the global internal table remains fixed in the first *user 11051 data SGPR register*. Otherwise all parameters are optional, and can be mapped 11052 to any desired *user data SGPR register*, with the following restrictions: 11053 11054 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 11055 active hardware stage in a graphics pipeline (i.e. where the API vertex 11056 shader runs). 11057 11058 * Application-controlled user data must be mapped into a contiguous range of 11059 user data registers. 11060 11061 * The application-controlled user data range supports compaction remapping, so 11062 only *entries* that are actually consumed by the shader must be assigned to 11063 corresponding *registers*. Note that in order to support an efficient runtime 11064 implementation, the remapping must pack *registers* in the same order as 11065 *entries*, with unused *entries* removed. 11066 11067.. _pal_global_internal_table: 11068 11069Global Internal Table 11070~~~~~~~~~~~~~~~~~~~~~ 11071 11072The global internal table is a table of *shader resource descriptors* (SRDs) 11073that define how certain engine-wide, runtime-managed resources should be 11074accessed from a shader. The majority of these resources have HW-defined formats, 11075and it is up to the compiler to write/read data as required by the target 11076hardware. 11077 11078The following table illustrates the required format: 11079 11080 .. table:: PAL Global Internal Table 11081 :name: pal-git-table 11082 11083 ============= ================================ 11084 Offset Description 11085 ============= ================================ 11086 0-3 Graphics Scratch SRD 11087 4-7 Compute Scratch SRD 11088 8-11 ES/GS Ring Output SRD 11089 12-15 ES/GS Ring Input SRD 11090 16-19 GS/VS Ring Output #0 11091 20-23 GS/VS Ring Output #1 11092 24-27 GS/VS Ring Output #2 11093 28-31 GS/VS Ring Output #3 11094 32-35 GS/VS Ring Input SRD 11095 36-39 Tessellation Factor Buffer SRD 11096 40-43 Off-Chip LDS Buffer SRD 11097 44-47 Off-Chip Param Cache Buffer SRD 11098 48-51 Sample Position Buffer SRD 11099 52 vaRange::ShadowDescriptorTable High Bits 11100 ============= ================================ 11101 11102 The pointer to the global internal table passed to the shader as user data 11103 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 11104 the top 32 bits of the pipeline, so the shader may use the program 11105 counter's top 32 bits. 11106 11107.. _pal_call-convention: 11108 11109Call Convention 11110~~~~~~~~~~~~~~~ 11111 11112For graphics use cases, the calling convention is `amdgpu_gfx`. 11113 11114.. note:: 11115 11116 `amdgpu_gfx` Function calls are currently in development and are 11117 subject to major changes. 11118 11119This calling convention shares most properties with calling non-kernel 11120functions (see 11121:ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`). 11122Differences are: 11123 11124 - Currently there are none, differences will be listed here 11125 11126Unspecified OS 11127-------------- 11128 11129This section provides code conventions used when the target triple OS is 11130empty (see :ref:`amdgpu-target-triples`). 11131 11132Trap Handler ABI 11133~~~~~~~~~~~~~~~~ 11134 11135For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 11136not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 11137instructions are handled as follows: 11138 11139 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 11140 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 11141 11142 =============== =============== =========================================== 11143 Usage Code Sequence Description 11144 =============== =============== =========================================== 11145 llvm.trap s_endpgm Causes wavefront to be terminated. 11146 llvm.debugtrap *none* Compiler warning given that there is no 11147 trap handler installed. 11148 =============== =============== =========================================== 11149 11150Source Languages 11151================ 11152 11153.. _amdgpu-opencl: 11154 11155OpenCL 11156------ 11157 11158When the language is OpenCL the following differences occur: 11159 111601. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 111612. The AMDGPU backend appends additional arguments to the kernel's explicit 11162 arguments for the AMDHSA OS (see 11163 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 111643. Additional metadata is generated 11165 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 11166 11167 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 11168 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 11169 11170 ======== ==== ========= =========================================== 11171 Position Byte Byte Description 11172 Size Alignment 11173 ======== ==== ========= =========================================== 11174 1 8 8 OpenCL Global Offset X 11175 2 8 8 OpenCL Global Offset Y 11176 3 8 8 OpenCL Global Offset Z 11177 4 8 8 OpenCL address of printf buffer 11178 5 8 8 OpenCL address of virtual queue used by 11179 enqueue_kernel. 11180 6 8 8 OpenCL address of AqlWrap struct used by 11181 enqueue_kernel. 11182 7 8 8 Pointer argument used for Multi-gird 11183 synchronization. 11184 ======== ==== ========= =========================================== 11185 11186.. _amdgpu-hcc: 11187 11188HCC 11189--- 11190 11191When the language is HCC the following differences occur: 11192 111931. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 11194 11195.. _amdgpu-assembler: 11196 11197Assembler 11198--------- 11199 11200AMDGPU backend has LLVM-MC based assembler which is currently in development. 11201It supports AMDGCN GFX6-GFX10. 11202 11203This section describes general syntax for instructions and operands. 11204 11205Instructions 11206~~~~~~~~~~~~ 11207 11208An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 11209 11210 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 11211 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 11212 11213:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 11214:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 11215 11216The order of operands and modifiers is fixed. 11217Most modifiers are optional and may be omitted. 11218 11219Links to detailed instruction syntax description may be found in the following 11220table. Note that features under development are not included 11221in this description. 11222 11223 =================================== ======================================= 11224 Core ISA ISA Extensions 11225 =================================== ======================================= 11226 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 11227 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 11228 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 11229 11230 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 11231 11232 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 11233 11234 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 11235 11236 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 11237 11238 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 11239 11240 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 11241 11242 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 11243 =================================== ======================================= 11244 11245For more information about instructions, their semantics and supported 11246combinations of operands, refer to one of instruction set architecture manuals 11247[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_, 11248[AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_. 11249 11250Operands 11251~~~~~~~~ 11252 11253Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 11254 11255Modifiers 11256~~~~~~~~~ 11257 11258Detailed description of modifiers may be found 11259:doc:`here<AMDGPUModifierSyntax>`. 11260 11261Instruction Examples 11262~~~~~~~~~~~~~~~~~~~~ 11263 11264DS 11265++ 11266 11267.. code-block:: nasm 11268 11269 ds_add_u32 v2, v4 offset:16 11270 ds_write_src2_b64 v2 offset0:4 offset1:8 11271 ds_cmpst_f32 v2, v4, v6 11272 ds_min_rtn_f64 v[8:9], v2, v[4:5] 11273 11274For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 11275Manual. 11276 11277FLAT 11278++++ 11279 11280.. code-block:: nasm 11281 11282 flat_load_dword v1, v[3:4] 11283 flat_store_dwordx3 v[3:4], v[5:7] 11284 flat_atomic_swap v1, v[3:4], v5 glc 11285 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 11286 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 11287 11288For full list of supported instructions, refer to "FLAT instructions" in ISA 11289Manual. 11290 11291MUBUF 11292+++++ 11293 11294.. code-block:: nasm 11295 11296 buffer_load_dword v1, off, s[4:7], s1 11297 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 11298 buffer_store_format_xy v[1:2], off, s[4:7], s1 11299 buffer_wbinvl1 11300 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 11301 11302For full list of supported instructions, refer to "MUBUF Instructions" in ISA 11303Manual. 11304 11305SMRD/SMEM 11306+++++++++ 11307 11308.. code-block:: nasm 11309 11310 s_load_dword s1, s[2:3], 0xfc 11311 s_load_dwordx8 s[8:15], s[2:3], s4 11312 s_load_dwordx16 s[88:103], s[2:3], s4 11313 s_dcache_inv_vol 11314 s_memtime s[4:5] 11315 11316For full list of supported instructions, refer to "Scalar Memory Operations" in 11317ISA Manual. 11318 11319SOP1 11320++++ 11321 11322.. code-block:: nasm 11323 11324 s_mov_b32 s1, s2 11325 s_mov_b64 s[0:1], 0x80000000 11326 s_cmov_b32 s1, 200 11327 s_wqm_b64 s[2:3], s[4:5] 11328 s_bcnt0_i32_b64 s1, s[2:3] 11329 s_swappc_b64 s[2:3], s[4:5] 11330 s_cbranch_join s[4:5] 11331 11332For full list of supported instructions, refer to "SOP1 Instructions" in ISA 11333Manual. 11334 11335SOP2 11336++++ 11337 11338.. code-block:: nasm 11339 11340 s_add_u32 s1, s2, s3 11341 s_and_b64 s[2:3], s[4:5], s[6:7] 11342 s_cselect_b32 s1, s2, s3 11343 s_andn2_b32 s2, s4, s6 11344 s_lshr_b64 s[2:3], s[4:5], s6 11345 s_ashr_i32 s2, s4, s6 11346 s_bfm_b64 s[2:3], s4, s6 11347 s_bfe_i64 s[2:3], s[4:5], s6 11348 s_cbranch_g_fork s[4:5], s[6:7] 11349 11350For full list of supported instructions, refer to "SOP2 Instructions" in ISA 11351Manual. 11352 11353SOPC 11354++++ 11355 11356.. code-block:: nasm 11357 11358 s_cmp_eq_i32 s1, s2 11359 s_bitcmp1_b32 s1, s2 11360 s_bitcmp0_b64 s[2:3], s4 11361 s_setvskip s3, s5 11362 11363For full list of supported instructions, refer to "SOPC Instructions" in ISA 11364Manual. 11365 11366SOPP 11367++++ 11368 11369.. code-block:: nasm 11370 11371 s_barrier 11372 s_nop 2 11373 s_endpgm 11374 s_waitcnt 0 ; Wait for all counters to be 0 11375 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 11376 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 11377 s_sethalt 9 11378 s_sleep 10 11379 s_sendmsg 0x1 11380 s_sendmsg sendmsg(MSG_INTERRUPT) 11381 s_trap 1 11382 11383For full list of supported instructions, refer to "SOPP Instructions" in ISA 11384Manual. 11385 11386Unless otherwise mentioned, little verification is performed on the operands 11387of SOPP Instructions, so it is up to the programmer to be familiar with the 11388range or acceptable values. 11389 11390VALU 11391++++ 11392 11393For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 11394the assembler will automatically use optimal encoding based on its operands. To 11395force specific encoding, one can add a suffix to the opcode of the instruction: 11396 11397* _e32 for 32-bit VOP1/VOP2/VOPC 11398* _e64 for 64-bit VOP3 11399* _dpp for VOP_DPP 11400* _sdwa for VOP_SDWA 11401 11402VOP1/VOP2/VOP3/VOPC examples: 11403 11404.. code-block:: nasm 11405 11406 v_mov_b32 v1, v2 11407 v_mov_b32_e32 v1, v2 11408 v_nop 11409 v_cvt_f64_i32_e32 v[1:2], v2 11410 v_floor_f32_e32 v1, v2 11411 v_bfrev_b32_e32 v1, v2 11412 v_add_f32_e32 v1, v2, v3 11413 v_mul_i32_i24_e64 v1, v2, 3 11414 v_mul_i32_i24_e32 v1, -3, v3 11415 v_mul_i32_i24_e32 v1, -100, v3 11416 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 11417 v_max_f16_e32 v1, v2, v3 11418 11419VOP_DPP examples: 11420 11421.. code-block:: nasm 11422 11423 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 11424 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11425 v_mov_b32 v0, v0 wave_shl:1 11426 v_mov_b32 v0, v0 row_mirror 11427 v_mov_b32 v0, v0 row_bcast:31 11428 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 11429 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11430 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11431 11432VOP_SDWA examples: 11433 11434.. code-block:: nasm 11435 11436 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 11437 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 11438 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 11439 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 11440 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 11441 11442For full list of supported instructions, refer to "Vector ALU instructions". 11443 11444.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 11445 11446Code Object V2 Predefined Symbols 11447~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11448 11449.. warning:: 11450 Code object V2 is not the default code object version emitted by 11451 this version of LLVM. 11452 11453The AMDGPU assembler defines and updates some symbols automatically. These 11454symbols do not affect code generation. 11455 11456.option.machine_version_major 11457+++++++++++++++++++++++++++++ 11458 11459Set to the GFX major generation number of the target being assembled for. For 11460example, when assembling for a "GFX9" target this will be set to the integer 11461value "9". The possible GFX major generation numbers are presented in 11462:ref:`amdgpu-processors`. 11463 11464.option.machine_version_minor 11465+++++++++++++++++++++++++++++ 11466 11467Set to the GFX minor generation number of the target being assembled for. For 11468example, when assembling for a "GFX810" target this will be set to the integer 11469value "1". The possible GFX minor generation numbers are presented in 11470:ref:`amdgpu-processors`. 11471 11472.option.machine_version_stepping 11473++++++++++++++++++++++++++++++++ 11474 11475Set to the GFX stepping generation number of the target being assembled for. 11476For example, when assembling for a "GFX704" target this will be set to the 11477integer value "4". The possible GFX stepping generation numbers are presented 11478in :ref:`amdgpu-processors`. 11479 11480.kernel.vgpr_count 11481++++++++++++++++++ 11482 11483Set to zero each time a 11484:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11485encountered. At each instruction, if the current value of this symbol is less 11486than or equal to the maximum VGPR number explicitly referenced within that 11487instruction then the symbol value is updated to equal that VGPR number plus 11488one. 11489 11490.kernel.sgpr_count 11491++++++++++++++++++ 11492 11493Set to zero each time a 11494:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11495encountered. At each instruction, if the current value of this symbol is less 11496than or equal to the maximum VGPR number explicitly referenced within that 11497instruction then the symbol value is updated to equal that SGPR number plus 11498one. 11499 11500.. _amdgpu-amdhsa-assembler-directives-v2: 11501 11502Code Object V2 Directives 11503~~~~~~~~~~~~~~~~~~~~~~~~~ 11504 11505.. warning:: 11506 Code object V2 is not the default code object version emitted by 11507 this version of LLVM. 11508 11509AMDGPU ABI defines auxiliary data in output code object. In assembly source, 11510one can specify them with assembler directives. 11511 11512.hsa_code_object_version major, minor 11513+++++++++++++++++++++++++++++++++++++ 11514 11515*major* and *minor* are integers that specify the version of the HSA code 11516object that will be generated by the assembler. 11517 11518.hsa_code_object_isa [major, minor, stepping, vendor, arch] 11519+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 11520 11521 11522*major*, *minor*, and *stepping* are all integers that describe the instruction 11523set architecture (ISA) version of the assembly program. 11524 11525*vendor* and *arch* are quoted strings. *vendor* should always be equal to 11526"AMD" and *arch* should always be equal to "AMDGPU". 11527 11528By default, the assembler will derive the ISA version, *vendor*, and *arch* 11529from the value of the -mcpu option that is passed to the assembler. 11530 11531.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 11532 11533.amdgpu_hsa_kernel (name) 11534+++++++++++++++++++++++++ 11535 11536This directives specifies that the symbol with given name is a kernel entry 11537point (label) and the object should contain corresponding symbol of type 11538STT_AMDGPU_HSA_KERNEL. 11539 11540.amd_kernel_code_t 11541++++++++++++++++++ 11542 11543This directive marks the beginning of a list of key / value pairs that are used 11544to specify the amd_kernel_code_t object that will be emitted by the assembler. 11545The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 11546amd_kernel_code_t values that are unspecified a default value will be used. The 11547default value for all keys is 0, with the following exceptions: 11548 11549- *amd_code_version_major* defaults to 1. 11550- *amd_kernel_code_version_minor* defaults to 2. 11551- *amd_machine_kind* defaults to 1. 11552- *amd_machine_version_major*, *machine_version_minor*, and 11553 *amd_machine_version_stepping* are derived from the value of the -mcpu option 11554 that is passed to the assembler. 11555- *kernel_code_entry_byte_offset* defaults to 256. 11556- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 11557 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 11558 Note that wavefront size is specified as a power of two, so a value of **n** 11559 means a size of 2^ **n**. 11560- *call_convention* defaults to -1. 11561- *kernarg_segment_alignment*, *group_segment_alignment*, and 11562 *private_segment_alignment* default to 4. Note that alignments are specified 11563 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 11564- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 11565 GFX90A onwards. 11566- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 11567 GFX10 onwards. 11568- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 11569 11570The *.amd_kernel_code_t* directive must be placed immediately after the 11571function label and before any instructions. 11572 11573For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 11574comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 11575 11576.. _amdgpu-amdhsa-assembler-example-v2: 11577 11578Code Object V2 Example Source Code 11579~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11580 11581.. warning:: 11582 Code Object V2 is not the default code object version emitted by 11583 this version of LLVM. 11584 11585Here is an example of a minimal assembly source file, defining one HSA kernel: 11586 11587.. code:: 11588 :number-lines: 11589 11590 .hsa_code_object_version 1,0 11591 .hsa_code_object_isa 11592 11593 .hsatext 11594 .globl hello_world 11595 .p2align 8 11596 .amdgpu_hsa_kernel hello_world 11597 11598 hello_world: 11599 11600 .amd_kernel_code_t 11601 enable_sgpr_kernarg_segment_ptr = 1 11602 is_ptr64 = 1 11603 compute_pgm_rsrc1_vgprs = 0 11604 compute_pgm_rsrc1_sgprs = 0 11605 compute_pgm_rsrc2_user_sgpr = 2 11606 compute_pgm_rsrc1_wgp_mode = 0 11607 compute_pgm_rsrc1_mem_ordered = 0 11608 compute_pgm_rsrc1_fwd_progress = 1 11609 .end_amd_kernel_code_t 11610 11611 s_load_dwordx2 s[0:1], s[0:1] 0x0 11612 v_mov_b32 v0, 3.14159 11613 s_waitcnt lgkmcnt(0) 11614 v_mov_b32 v1, s0 11615 v_mov_b32 v2, s1 11616 flat_store_dword v[1:2], v0 11617 s_endpgm 11618 .Lfunc_end0: 11619 .size hello_world, .Lfunc_end0-hello_world 11620 11621.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4: 11622 11623Code Object V3 to V4 Predefined Symbols 11624~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11625 11626The AMDGPU assembler defines and updates some symbols automatically. These 11627symbols do not affect code generation. 11628 11629.amdgcn.gfx_generation_number 11630+++++++++++++++++++++++++++++ 11631 11632Set to the GFX major generation number of the target being assembled for. For 11633example, when assembling for a "GFX9" target this will be set to the integer 11634value "9". The possible GFX major generation numbers are presented in 11635:ref:`amdgpu-processors`. 11636 11637.amdgcn.gfx_generation_minor 11638++++++++++++++++++++++++++++ 11639 11640Set to the GFX minor generation number of the target being assembled for. For 11641example, when assembling for a "GFX810" target this will be set to the integer 11642value "1". The possible GFX minor generation numbers are presented in 11643:ref:`amdgpu-processors`. 11644 11645.amdgcn.gfx_generation_stepping 11646+++++++++++++++++++++++++++++++ 11647 11648Set to the GFX stepping generation number of the target being assembled for. 11649For example, when assembling for a "GFX704" target this will be set to the 11650integer value "4". The possible GFX stepping generation numbers are presented 11651in :ref:`amdgpu-processors`. 11652 11653.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 11654 11655.amdgcn.next_free_vgpr 11656++++++++++++++++++++++ 11657 11658Set to zero before assembly begins. At each instruction, if the current value 11659of this symbol is less than or equal to the maximum VGPR number explicitly 11660referenced within that instruction then the symbol value is updated to equal 11661that VGPR number plus one. 11662 11663May be used to set the `.amdhsa_next_free_vgpr` directive in 11664:ref:`amdhsa-kernel-directives-table`. 11665 11666May be set at any time, e.g. manually set to zero at the start of each kernel. 11667 11668.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 11669 11670.amdgcn.next_free_sgpr 11671++++++++++++++++++++++ 11672 11673Set to zero before assembly begins. At each instruction, if the current value 11674of this symbol is less than or equal the maximum SGPR number explicitly 11675referenced within that instruction then the symbol value is updated to equal 11676that SGPR number plus one. 11677 11678May be used to set the `.amdhsa_next_free_spgr` directive in 11679:ref:`amdhsa-kernel-directives-table`. 11680 11681May be set at any time, e.g. manually set to zero at the start of each kernel. 11682 11683.. _amdgpu-amdhsa-assembler-directives-v3-v4: 11684 11685Code Object V3 to V4 Directives 11686~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11687 11688Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 11689architecture processors, and are not OS-specific. Directives which begin with 11690``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 11691``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 11692:ref:`amdgpu-processors`. 11693 11694.. _amdgpu-assembler-directive-amdgcn-target: 11695 11696.amdgcn_target <target-triple> "-" <target-id> 11697++++++++++++++++++++++++++++++++++++++++++++++ 11698 11699Optional directive which declares the ``<target-triple>-<target-id>`` supported 11700by the containing assembler source file. Used by the assembler to validate 11701command-line options such as ``-triple``, ``-mcpu``, and 11702``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 11703:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 11704 11705.. note:: 11706 11707 The target ID syntax used for code object V2 to V3 for this directive differs 11708 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 11709 11710.amdhsa_kernel <name> 11711+++++++++++++++++++++ 11712 11713Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 11714``<name>.kd``, in the current location of the current section. Only valid when 11715the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 11716instruction to execute, and does not need to be previously defined. 11717 11718Marks the beginning of a list of directives used to generate the bytes of a 11719kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 11720Directives which may appear in this list are described in 11721:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 11722be valid for the target being assembled for, and cannot be repeated. Directives 11723support the range of values specified by the field they reference in 11724:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 11725assumed to have its default value, unless it is marked as "Required", in which 11726case it is an error to omit the directive. This list of directives is 11727terminated by an ``.end_amdhsa_kernel`` directive. 11728 11729 .. table:: AMDHSA Kernel Assembler Directives 11730 :name: amdhsa-kernel-directives-table 11731 11732 ======================================================== =================== ============ =================== 11733 Directive Default Supported On Description 11734 ======================================================== =================== ============ =================== 11735 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 11736 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11737 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 11738 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11739 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 11740 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11741 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 11742 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11743 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 11744 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11745 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 11746 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11747 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 11748 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11749 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 11750 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11751 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 11752 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11753 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 11754 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11755 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 11756 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11757 Specific 11758 (wavefrontsize64) 11759 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 11760 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11761 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 11762 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11763 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 11764 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11765 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 11766 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11767 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 11768 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11769 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 11770 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11771 Possible values are defined in 11772 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 11773 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 11774 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 11775 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11776 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 11777 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11778 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11779 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file. 11780 Used to calculate ACCUM_OFFSET in 11781 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 11782 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 11783 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11784 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11785 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 11786 scratch memory. Used to calculate 11787 GRANULATED_WAVEFRONT_SGPR_COUNT in 11788 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11789 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 11790 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11791 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11792 (xnack) 11793 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 11794 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11795 Possible values are defined in 11796 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 11797 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 11798 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11799 Possible values are defined in 11800 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 11801 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 11802 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11803 Possible values are defined in 11804 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 11805 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 11806 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11807 Possible values are defined in 11808 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 11809 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 11810 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11811 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 11812 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11813 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 11814 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11815 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in 11816 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 11817 Specific 11818 (tgsplit) 11819 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 11820 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11821 Specific 11822 (cumode) 11823 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 11824 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11825 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 11826 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11827 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 11828 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11829 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 11830 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11831 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 11832 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11833 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 11834 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11835 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 11836 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11837 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 11838 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11839 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 11840 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11841 ======================================================== =================== ============ =================== 11842 11843.amdgpu_metadata 11844++++++++++++++++ 11845 11846Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 11847note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`). 11848 11849The contents must be in the [YAML]_ markup format, with the same structure and 11850semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or 11851:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 11852 11853This directive is terminated by an ``.end_amdgpu_metadata`` directive. 11854 11855.. _amdgpu-amdhsa-assembler-example-v3-v4: 11856 11857Code Object V3 to V4 Example Source Code 11858~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11859 11860Here is an example of a minimal assembly source file, defining one HSA kernel: 11861 11862.. code:: 11863 :number-lines: 11864 11865 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 11866 11867 .text 11868 .globl hello_world 11869 .p2align 8 11870 .type hello_world,@function 11871 hello_world: 11872 s_load_dwordx2 s[0:1], s[0:1] 0x0 11873 v_mov_b32 v0, 3.14159 11874 s_waitcnt lgkmcnt(0) 11875 v_mov_b32 v1, s0 11876 v_mov_b32 v2, s1 11877 flat_store_dword v[1:2], v0 11878 s_endpgm 11879 .Lfunc_end0: 11880 .size hello_world, .Lfunc_end0-hello_world 11881 11882 .rodata 11883 .p2align 6 11884 .amdhsa_kernel hello_world 11885 .amdhsa_user_sgpr_kernarg_segment_ptr 1 11886 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 11887 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 11888 .end_amdhsa_kernel 11889 11890 .amdgpu_metadata 11891 --- 11892 amdhsa.version: 11893 - 1 11894 - 0 11895 amdhsa.kernels: 11896 - .name: hello_world 11897 .symbol: hello_world.kd 11898 .kernarg_segment_size: 48 11899 .group_segment_fixed_size: 0 11900 .private_segment_fixed_size: 0 11901 .kernarg_segment_align: 4 11902 .wavefront_size: 64 11903 .sgpr_count: 2 11904 .vgpr_count: 3 11905 .max_flat_workgroup_size: 256 11906 ... 11907 .end_amdgpu_metadata 11908 11909If an assembly source file contains multiple kernels and/or functions, the 11910:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 11911:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 11912the ``.set <symbol>, <expression>`` directive. For example, in the case of two 11913kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 11914to group the function with the kernel that calls it and reset the symbols 11915between the two connected components: 11916 11917.. code:: 11918 :number-lines: 11919 11920 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 11921 11922 // gpr tracking symbols are implicitly set to zero 11923 11924 .text 11925 .globl kern0 11926 .p2align 8 11927 .type kern0,@function 11928 kern0: 11929 // ... 11930 s_endpgm 11931 .Lkern0_end: 11932 .size kern0, .Lkern0_end-kern0 11933 11934 .rodata 11935 .p2align 6 11936 .amdhsa_kernel kern0 11937 // ... 11938 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 11939 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 11940 .end_amdhsa_kernel 11941 11942 // reset symbols to begin tracking usage in func1 and kern1 11943 .set .amdgcn.next_free_vgpr, 0 11944 .set .amdgcn.next_free_sgpr, 0 11945 11946 .text 11947 .hidden func1 11948 .global func1 11949 .p2align 2 11950 .type func1,@function 11951 func1: 11952 // ... 11953 s_setpc_b64 s[30:31] 11954 .Lfunc1_end: 11955 .size func1, .Lfunc1_end-func1 11956 11957 .globl kern1 11958 .p2align 8 11959 .type kern1,@function 11960 kern1: 11961 // ... 11962 s_getpc_b64 s[4:5] 11963 s_add_u32 s4, s4, func1@rel32@lo+4 11964 s_addc_u32 s5, s5, func1@rel32@lo+4 11965 s_swappc_b64 s[30:31], s[4:5] 11966 // ... 11967 s_endpgm 11968 .Lkern1_end: 11969 .size kern1, .Lkern1_end-kern1 11970 11971 .rodata 11972 .p2align 6 11973 .amdhsa_kernel kern1 11974 // ... 11975 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 11976 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 11977 .end_amdhsa_kernel 11978 11979These symbols cannot identify connected components in order to automatically 11980track the usage for each kernel. However, in some cases careful organization of 11981the kernels and functions in the source file means there is minimal additional 11982effort required to accurately calculate GPR usage. 11983 11984Additional Documentation 11985======================== 11986 11987.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 11988.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 11989.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 11990.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 11991.. [AMD-GCN-GFX10-RDNA1] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 11992.. [AMD-GCN-GFX10-RDNA2] `AMD "RDNA 2" Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 11993.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 11994.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 11995.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 11996.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 11997.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 11998.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 11999.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 12000.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 12001.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 12002.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 12003.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 12004.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 12005.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 12006.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 12007.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 12008.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 12009