1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX10 20 AMDGPU/AMDGPUAsmGFX1011 21 AMDGPUModifierSyntax 22 AMDGPUOperandSyntax 23 AMDGPUInstructionSyntax 24 AMDGPUInstructionNotation 25 AMDGPUDwarfExtensionsForHeterogeneousDebugging 26 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack 27 28Introduction 29============ 30 31The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 32R600 family up until the current GCN families. It lives in the 33``llvm/lib/Target/AMDGPU`` directory. 34 35LLVM 36==== 37 38.. _amdgpu-target-triples: 39 40Target Triples 41-------------- 42 43Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 44to specify the target triple: 45 46 .. table:: AMDGPU Architectures 47 :name: amdgpu-architecture-table 48 49 ============ ============================================================== 50 Architecture Description 51 ============ ============================================================== 52 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 53 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 54 ============ ============================================================== 55 56 .. table:: AMDGPU Vendors 57 :name: amdgpu-vendor-table 58 59 ============ ============================================================== 60 Vendor Description 61 ============ ============================================================== 62 ``amd`` Can be used for all AMD GPU usage. 63 ``mesa3d`` Can be used if the OS is ``mesa3d``. 64 ============ ============================================================== 65 66 .. table:: AMDGPU Operating Systems 67 :name: amdgpu-os 68 69 ============== ============================================================ 70 OS Description 71 ============== ============================================================ 72 *<empty>* Defaults to the *unknown* OS. 73 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 74 such as: 75 76 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 77 loader on Linux. See *AMD ROCm Platform Release Notes* 78 [AMD-ROCm-Release-Notes]_ for supported hardware and 79 software. 80 - AMD's PAL runtime using the *pal-amdhsa* loader on 81 Windows. 82 83 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 84 runtime using the *pal-amdpal* loader on Windows and Linux 85 Pro. 86 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 87 3D runtime using the *mesa-mesa3d* loader on Linux. 88 ============== ============================================================ 89 90 .. table:: AMDGPU Environments 91 :name: amdgpu-environment-table 92 93 ============ ============================================================== 94 Environment Description 95 ============ ============================================================== 96 *<empty>* Default. 97 ============ ============================================================== 98 99.. _amdgpu-processors: 100 101Processors 102---------- 103 104Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 105specify the AMDGPU processor together with optional target features. See 106:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 107specific information. 108 109Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 110 111* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 112 113 114 .. table:: AMDGPU Processors 115 :name: amdgpu-processor-table 116 117 =========== =============== ============ ===== ================= =============== =============== ====================== 118 Processor Alternative Target dGPU/ Target Target OS Support Example 119 Processor Triple APU Features Properties *(see* Products 120 Architecture Supported `amdgpu-os`_ 121 *and 122 corresponding 123 runtime release 124 notes for 125 current 126 information and 127 level of 128 support)* 129 =========== =============== ============ ===== ================= =============== =============== ====================== 130 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 131 ----------------------------------------------------------------------------------------------------------------------- 132 ``r600`` ``r600`` dGPU - Does not 133 support 134 generic 135 address 136 space 137 ``r630`` ``r600`` dGPU - Does not 138 support 139 generic 140 address 141 space 142 ``rs880`` ``r600`` dGPU - Does not 143 support 144 generic 145 address 146 space 147 ``rv670`` ``r600`` dGPU - Does not 148 support 149 generic 150 address 151 space 152 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 153 ----------------------------------------------------------------------------------------------------------------------- 154 ``rv710`` ``r600`` dGPU - Does not 155 support 156 generic 157 address 158 space 159 ``rv730`` ``r600`` dGPU - Does not 160 support 161 generic 162 address 163 space 164 ``rv770`` ``r600`` dGPU - Does not 165 support 166 generic 167 address 168 space 169 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 170 ----------------------------------------------------------------------------------------------------------------------- 171 ``cedar`` ``r600`` dGPU - Does not 172 support 173 generic 174 address 175 space 176 ``cypress`` ``r600`` dGPU - Does not 177 support 178 generic 179 address 180 space 181 ``juniper`` ``r600`` dGPU - Does not 182 support 183 generic 184 address 185 space 186 ``redwood`` ``r600`` dGPU - Does not 187 support 188 generic 189 address 190 space 191 ``sumo`` ``r600`` dGPU - Does not 192 support 193 generic 194 address 195 space 196 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 197 ----------------------------------------------------------------------------------------------------------------------- 198 ``barts`` ``r600`` dGPU - Does not 199 support 200 generic 201 address 202 space 203 ``caicos`` ``r600`` dGPU - Does not 204 support 205 generic 206 address 207 space 208 ``cayman`` ``r600`` dGPU - Does not 209 support 210 generic 211 address 212 space 213 ``turks`` ``r600`` dGPU - Does not 214 support 215 generic 216 address 217 space 218 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 219 ----------------------------------------------------------------------------------------------------------------------- 220 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 221 support 222 generic 223 address 224 space 225 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 226 - ``verde`` support 227 generic 228 address 229 space 230 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 231 - ``oland`` support 232 generic 233 address 234 space 235 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 236 ----------------------------------------------------------------------------------------------------------------------- 237 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 238 flat - *pal-amdhsa* - A6 Pro-7050B 239 scratch - *pal-amdpal* - A8-7100 240 - A8 Pro-7150B 241 - A10-7300 242 - A10 Pro-7350B 243 - FX-7500 244 - A8-7200P 245 - A10-7400P 246 - FX-7600P 247 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 248 flat - *pal-amdhsa* - FirePro W9100 249 scratch - *pal-amdpal* - FirePro S9150 250 - FirePro S9170 251 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 252 flat - *pal-amdhsa* - Radeon R9 290x 253 scratch - *pal-amdpal* - Radeon R390 254 - Radeon R390x 255 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 256 - ``mullins`` flat - *pal-amdpal* - E1-2200 257 scratch - E1-2500 258 - E2-3000 259 - E2-3800 260 - A4-5000 261 - A4-5100 262 - A6-5200 263 - A4 Pro-3340B 264 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 265 flat - *pal-amdpal* - Radeon HD 8770 266 scratch - R7 260 267 - R7 260X 268 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 269 flat - *pal-amdpal* 270 scratch .. TODO:: 271 272 Add product 273 names. 274 275 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 276 ----------------------------------------------------------------------------------------------------------------------- 277 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 278 flat - *pal-amdhsa* - Pro A6-8500B 279 scratch - *pal-amdpal* - A8-8600P 280 - Pro A8-8600B 281 - FX-8800P 282 - Pro A12-8800B 283 - A10-8700P 284 - Pro A10-8700B 285 - A10-8780P 286 - A10-9600P 287 - A10-9630P 288 - A12-9700P 289 - A12-9730P 290 - FX-9800P 291 - FX-9830P 292 - E2-9010 293 - A6-9210 294 - A9-9410 295 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 296 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 297 scratch - *pal-amdpal* - Radeon R9 385 298 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 299 - *pal-amdhsa* - Radeon R9 Fury 300 - *pal-amdpal* - Radeon R9 FuryX 301 - Radeon Pro Duo 302 - FirePro S9300x2 303 - Radeon Instinct MI8 304 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 305 flat - *pal-amdhsa* - Radeon RX 480 306 scratch - *pal-amdpal* - Radeon Instinct MI6 307 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 308 flat - *pal-amdhsa* 309 scratch - *pal-amdpal* 310 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 311 flat - *pal-amdhsa* - FirePro S7100 312 scratch - *pal-amdpal* - FirePro W7100 313 - Mobile FirePro 314 M7170 315 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 316 flat - *pal-amdhsa* 317 scratch - *pal-amdpal* .. TODO:: 318 319 Add product 320 names. 321 322 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ 323 ----------------------------------------------------------------------------------------------------------------------- 324 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 325 flat - *pal-amdhsa* Frontier Edition 326 scratch - *pal-amdpal* - Radeon RX Vega 56 327 - Radeon RX Vega 64 328 - Radeon RX Vega 64 329 Liquid 330 - Radeon Instinct MI25 331 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 332 flat - *pal-amdhsa* - Ryzen 5 2400G 333 scratch - *pal-amdpal* 334 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 335 - *pal-amdhsa* 336 - *pal-amdpal* .. TODO:: 337 338 Add product 339 names. 340 341 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 342 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 343 scratch - *pal-amdpal* - Radeon VII 344 - Radeon Pro VII 345 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 346 - xnack - Absolute 347 flat 348 scratch 349 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 350 flat 351 scratch .. TODO:: 352 353 Add product 354 names. 355 356 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 357 - tgsplit flat 358 - xnack scratch .. TODO:: 359 - Packed 360 work-item Add product 361 IDs names. 362 363 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 364 flat - Ryzen 7 4700GE 365 scratch - Ryzen 5 4600G 366 - Ryzen 5 4600GE 367 - Ryzen 3 4300G 368 - Ryzen 3 4300GE 369 - Ryzen Pro 4000G 370 - Ryzen 7 Pro 4700G 371 - Ryzen 7 Pro 4750GE 372 - Ryzen 5 Pro 4650G 373 - Ryzen 5 Pro 4650GE 374 - Ryzen 3 Pro 4350G 375 - Ryzen 3 Pro 4350GE 376 377 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 378 ----------------------------------------------------------------------------------------------------------------------- 379 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 380 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 381 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 382 - Radeon Pro 5600M 383 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 384 - wavefrontsize64 - Absolute - *pal-amdhsa* 385 - xnack flat - *pal-amdpal* 386 scratch 387 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 388 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 389 - xnack scratch - *pal-amdpal* 390 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 391 - wavefrontsize64 flat - *pal-amdhsa* 392 - xnack scratch - *pal-amdpal* .. TODO:: 393 394 Add product 395 names. 396 397 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 398 ----------------------------------------------------------------------------------------------------------------------- 399 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 400 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 401 scratch - *pal-amdpal* - Radeon RX 6900 XT 402 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 403 - wavefrontsize64 flat - *pal-amdhsa* 404 scratch - *pal-amdpal* 405 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 406 - wavefrontsize64 flat - *pal-amdhsa* 407 scratch - *pal-amdpal* .. TODO:: 408 409 Add product 410 names. 411 412 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 413 - wavefrontsize64 flat 414 scratch .. TODO:: 415 416 Add product 417 names. 418 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 419 - wavefrontsize64 flat 420 scratch .. TODO:: 421 422 Add product 423 names. 424 425 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 426 - wavefrontsize64 flat 427 scratch .. TODO:: 428 Add product 429 names. 430 431 =========== =============== ============ ===== ================= =============== =============== ====================== 432 433.. _amdgpu-target-features: 434 435Target Features 436--------------- 437 438Target features control how code is generated to support certain 439processor specific features. Not all target features are supported by 440all processors. The runtime must ensure that the features supported by 441the device used to execute the code match the features enabled when 442generating the code. A mismatch of features may result in incorrect 443execution, or a reduction in performance. 444 445The target features supported by each processor is listed in 446:ref:`amdgpu-processor-table`. 447 448Target features are controlled by exactly one of the following Clang 449options: 450 451``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 452 453 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 454 optional components of the target ID. If omitted, the target feature has the 455 ``any`` value. See :ref:`amdgpu-target-id`. 456 457``-m[no-]<target-feature>`` 458 459 Target features not specified by the target ID are specified using a 460 separate option. These target features can have an ``on`` or ``off`` 461 value. ``on`` is specified by omitting the ``no-`` prefix, and 462 ``off`` is specified by including the ``no-`` prefix. The default 463 if not specified is ``off``. 464 465For example: 466 467``-mcpu=gfx908:xnack+`` 468 Enable the ``xnack`` feature. 469``-mcpu=gfx908:xnack-`` 470 Disable the ``xnack`` feature. 471``-mcumode`` 472 Enable the ``cumode`` feature. 473``-mno-cumode`` 474 Disable the ``cumode`` feature. 475 476 .. table:: AMDGPU Target Features 477 :name: amdgpu-target-features-table 478 479 =============== ============================ ================================================== 480 Target Feature Clang Option to Control Description 481 Name 482 =============== ============================ ================================================== 483 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 484 when generating code for kernels. When disabled 485 native WGP wavefront execution mode is used, 486 when enabled CU wavefront execution mode is used 487 (see :ref:`amdgpu-amdhsa-memory-model`). 488 489 sramecc - ``-mcpu`` If specified, generate code that can only be 490 - ``--offload-arch`` loaded and executed in a process that has a 491 matching setting for SRAMECC. 492 493 If not specified for code object V2 to V3, generate 494 code that can be loaded and executed in a process 495 with SRAMECC enabled. 496 497 If not specified for code object V4, generate 498 code that can be loaded and executed in a process 499 with either setting of SRAMECC. 500 501 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 502 work-groups are launched in threadgroup split mode. 503 When enabled the waves of a work-group may be 504 launched in different CUs. 505 506 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 507 generating code for kernels. When disabled 508 native wavefront size 32 is used, when enabled 509 wavefront size 64 is used. 510 511 xnack - ``-mcpu`` If specified, generate code that can only be 512 - ``--offload-arch`` loaded and executed in a process that has a 513 matching setting for XNACK replay. 514 515 If not specified for code object V2 to V3, generate 516 code that can be loaded and executed in a process 517 with XNACK replay enabled. 518 519 If not specified for code object V4, generate 520 code that can be loaded and executed in a process 521 with either setting of XNACK replay. 522 523 XNACK replay can be used for demand paging and 524 page migration. If enabled in the device, then if 525 a page fault occurs the code may execute 526 incorrectly unless generated with XNACK replay 527 enabled, or generated for code object V4 without 528 specifying XNACK replay. Executing code that was 529 generated with XNACK replay enabled, or generated 530 for code object V4 without specifying XNACK replay, 531 on a device that does not have XNACK replay 532 enabled will execute correctly but may be less 533 performant than code generated for XNACK replay 534 disabled. 535 =============== ============================ ================================================== 536 537.. _amdgpu-target-id: 538 539Target ID 540--------- 541 542AMDGPU supports target IDs. See `Clang Offload Bundler 543<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 544description. The AMDGPU target specific information is: 545 546**processor** 547 Is an AMDGPU processor or alternative processor name specified in 548 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 549 the primary processor and alternative processor names. The canonical form 550 target ID only allow the primary processor name. 551 552**target-feature** 553 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 554 is supported by the processor. The target features supported by each processor 555 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 556 a target ID are marked as being controlled by ``-mcpu`` and 557 ``--offload-arch``. Each target feature must appear at most once in a target 558 ID. The non-canonical form target ID allows the target features to be 559 specified in any order. The canonical form target ID requires the target 560 features to be specified in alphabetic order. 561 562.. _amdgpu-target-id-v2-v3: 563 564Code Object V2 to V3 Target ID 565~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 566 567The target ID syntax for code object V2 to V3 is the same as defined in `Clang 568Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 569when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 570directive and the bundle entry ID. In those cases it has the following BNF 571syntax: 572 573.. code:: 574 575 <target-id> ::== <processor> ( "+" <target-feature> )* 576 577Where a target feature is omitted if *Off* and present if *On* or *Any*. 578 579.. note:: 580 581 The code object V2 to V3 cannot represent *Any* and treats it the same as 582 *On*. 583 584.. _amdgpu-embedding-bundled-objects: 585 586Embedding Bundled Code Objects 587------------------------------ 588 589AMDGPU supports the HIP and OpenMP languages that perform code object embedding 590as described in `Clang Offload Bundler 591<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 592 593.. note:: 594 595 The target ID syntax used for code object V2 to V3 for a bundle entry ID 596 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 597 598.. _amdgpu-address-spaces: 599 600Address Spaces 601-------------- 602 603The AMDGPU architecture supports a number of memory address spaces. The address 604space names use the OpenCL standard names, with some additions. 605 606The AMDGPU address spaces correspond to target architecture specific LLVM 607address space numbers used in LLVM IR. 608 609The AMDGPU address spaces are described in 610:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 611supported for the ``amdgcn`` target. 612 613 .. table:: AMDGPU Address Spaces 614 :name: amdgpu-address-spaces-table 615 616 ================================= =============== =========== ================ ======= ============================ 617 .. 64-Bit Process Address Space 618 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 619 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 620 Space Number Name Name Size 621 ================================= =============== =========== ================ ======= ============================ 622 Generic 0 flat flat 64 0x0000000000000000 623 Global 1 global global 64 0x0000000000000000 624 Region 2 N/A GDS 32 *not implemented for AMDHSA* 625 Local 3 group LDS 32 0xFFFFFFFF 626 Constant 4 constant *same as global* 64 0x0000000000000000 627 Private 5 private scratch 32 0xFFFFFFFF 628 Constant 32-bit 6 *TODO* 0x00000000 629 Buffer Fat Pointer (experimental) 7 *TODO* 630 ================================= =============== =========== ================ ======= ============================ 631 632**Generic** 633 The generic address space is supported unless the *Target Properties* column 634 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 635 space*. 636 637 The generic address space uses the hardware flat address support for two fixed 638 ranges of virtual addresses (the private and local apertures), that are 639 outside the range of addressable global memory, to map from a flat address to 640 a private or local address. This uses FLAT instructions that can take a flat 641 address and access global, private (scratch), and group (LDS) memory depending 642 on if the address is within one of the aperture ranges. 643 644 Flat access to scratch requires hardware aperture setup and setup in the 645 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 646 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 647 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 648 649 To convert between a private or group address space address (termed a segment 650 address) and a flat address the base address of the corresponding aperture 651 can be used. For GFX7-GFX8 these are available in the 652 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 653 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 654 GFX9-GFX10 the aperture base addresses are directly available as inline 655 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 656 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 657 aligned to 2^32 which makes it easier to convert from flat to segment or 658 segment to flat. 659 660 A global address space address has the same value when used as a flat address 661 so no conversion is needed. 662 663**Global and Constant** 664 The global and constant address spaces both use global virtual addresses, 665 which are the same virtual address space used by the CPU. However, some 666 virtual addresses may only be accessible to the CPU, some only accessible 667 by the GPU, and some by both. 668 669 Using the constant address space indicates that the data will not change 670 during the execution of the kernel. This allows scalar read instructions to 671 be used. As the constant address space could only be modified on the host 672 side, a generic pointer loaded from the constant address space is safe to be 673 assumed as a global pointer since only the device global memory is visible 674 and managed on the host side. The vector and scalar L1 caches are invalidated 675 of volatile data before each kernel dispatch execution to allow constant 676 memory to change values between kernel dispatches. 677 678**Region** 679 The region address space uses the hardware Global Data Store (GDS). All 680 wavefronts executing on the same device will access the same memory for any 681 given region address. However, the same region address accessed by wavefronts 682 executing on different devices will access different memory. It is higher 683 performance than global memory. It is allocated by the runtime. The data 684 store (DS) instructions can be used to access it. 685 686**Local** 687 The local address space uses the hardware Local Data Store (LDS) which is 688 automatically allocated when the hardware creates the wavefronts of a 689 work-group, and freed when all the wavefronts of a work-group have 690 terminated. All wavefronts belonging to the same work-group will access the 691 same memory for any given local address. However, the same local address 692 accessed by wavefronts belonging to different work-groups will access 693 different memory. It is higher performance than global memory. The data store 694 (DS) instructions can be used to access it. 695 696**Private** 697 The private address space uses the hardware scratch memory support which 698 automatically allocates memory when it creates a wavefront and frees it when 699 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 700 given private address will be different to the memory accessed by another lane 701 of the same or different wavefront for the same private address. 702 703 If a kernel dispatch uses scratch, then the hardware allocates memory from a 704 pool of backing memory allocated by the runtime for each wavefront. The lanes 705 of the wavefront access this using dword (4 byte) interleaving. The mapping 706 used from private address to backing memory address is: 707 708 ``wavefront-scratch-base + 709 ((private-address / 4) * wavefront-size * 4) + 710 (wavefront-lane-id * 4) + (private-address % 4)`` 711 712 If each lane of a wavefront accesses the same private address, the 713 interleaving results in adjacent dwords being accessed and hence requires 714 fewer cache lines to be fetched. 715 716 There are different ways that the wavefront scratch base address is 717 determined by a wavefront (see 718 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 719 720 Scratch memory can be accessed in an interleaved manner using buffer 721 instructions with the scratch buffer descriptor and per wavefront scratch 722 offset, by the scratch instructions, or by flat instructions. Multi-dword 723 access is not supported except by flat and scratch instructions in 724 GFX9-GFX10. 725 726**Constant 32-bit** 727 *TODO* 728 729**Buffer Fat Pointer** 730 The buffer fat pointer is an experimental address space that is currently 731 unsupported in the backend. It exposes a non-integral pointer that is in 732 the future intended to support the modelling of 128-bit buffer descriptors 733 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 734 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 735 model the buffer descriptors used heavily in graphics workloads targeting 736 the backend. 737 738.. _amdgpu-memory-scopes: 739 740Memory Scopes 741------------- 742 743This section provides LLVM memory synchronization scopes supported by the AMDGPU 744backend memory model when the target triple OS is ``amdhsa`` (see 745:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 746 747The memory model supported is based on the HSA memory model [HSA]_ which is 748based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 749relation is transitive over the synchronizes-with relation independent of scope 750and synchronizes-with allows the memory scope instances to be inclusive (see 751table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 752 753This is different to the OpenCL [OpenCL]_ memory model which does not have scope 754inclusion and requires the memory scopes to exactly match. However, this 755is conservatively correct for OpenCL. 756 757 .. table:: AMDHSA LLVM Sync Scopes 758 :name: amdgpu-amdhsa-llvm-sync-scopes-table 759 760 ======================= =================================================== 761 LLVM Sync Scope Description 762 ======================= =================================================== 763 *none* The default: ``system``. 764 765 Synchronizes with, and participates in modification 766 and seq_cst total orderings with, other operations 767 (except image operations) for all address spaces 768 (except private, or generic that accesses private) 769 provided the other operation's sync scope is: 770 771 - ``system``. 772 - ``agent`` and executed by a thread on the same 773 agent. 774 - ``workgroup`` and executed by a thread in the 775 same work-group. 776 - ``wavefront`` and executed by a thread in the 777 same wavefront. 778 779 ``agent`` Synchronizes with, and participates in modification 780 and seq_cst total orderings with, other operations 781 (except image operations) for all address spaces 782 (except private, or generic that accesses private) 783 provided the other operation's sync scope is: 784 785 - ``system`` or ``agent`` and executed by a thread 786 on the same agent. 787 - ``workgroup`` and executed by a thread in the 788 same work-group. 789 - ``wavefront`` and executed by a thread in the 790 same wavefront. 791 792 ``workgroup`` Synchronizes with, and participates in modification 793 and seq_cst total orderings with, other operations 794 (except image operations) for all address spaces 795 (except private, or generic that accesses private) 796 provided the other operation's sync scope is: 797 798 - ``system``, ``agent`` or ``workgroup`` and 799 executed by a thread in the same work-group. 800 - ``wavefront`` and executed by a thread in the 801 same wavefront. 802 803 ``wavefront`` Synchronizes with, and participates in modification 804 and seq_cst total orderings with, other operations 805 (except image operations) for all address spaces 806 (except private, or generic that accesses private) 807 provided the other operation's sync scope is: 808 809 - ``system``, ``agent``, ``workgroup`` or 810 ``wavefront`` and executed by a thread in the 811 same wavefront. 812 813 ``singlethread`` Only synchronizes with and participates in 814 modification and seq_cst total orderings with, 815 other operations (except image operations) running 816 in the same thread for all address spaces (for 817 example, in signal handlers). 818 819 ``one-as`` Same as ``system`` but only synchronizes with other 820 operations within the same address space. 821 822 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 823 operations within the same address space. 824 825 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 826 other operations within the same address space. 827 828 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 829 other operations within the same address space. 830 831 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 832 other operations within the same address space. 833 ======================= =================================================== 834 835LLVM IR Intrinsics 836------------------ 837 838The AMDGPU backend implements the following LLVM IR intrinsics. 839 840*This section is WIP.* 841 842.. TODO:: 843 844 List AMDGPU intrinsics. 845 846LLVM IR Attributes 847------------------ 848 849The AMDGPU backend supports the following LLVM IR attributes. 850 851 .. table:: AMDGPU LLVM IR Attributes 852 :name: amdgpu-llvm-ir-attributes-table 853 854 ======================================= ========================================================== 855 LLVM Attribute Description 856 ======================================= ========================================================== 857 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 858 will be specified when the kernel is dispatched. Generated 859 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 860 The implied default value is 1,1024. 861 862 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 863 argument block size for the implicit arguments. This 864 varies by OS and language (for OpenCL see 865 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 866 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 867 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 868 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 869 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 870 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 871 execution unit. Generated by the ``amdgpu_waves_per_eu`` 872 CLANG attribute [CLANG-ATTR]_. This is an optimization hint, 873 and the backend may not be able to satisfy the request. If 874 the specified range is incompatible with the function's 875 "amdgpu-flat-work-group-size" value, the implied occupancy 876 bounds by the workgroup size takes precedence. 877 878 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 879 mode register to be set on entry. Overrides the default for 880 the calling convention. 881 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 882 the mode register to be set on entry. Overrides the default 883 for the calling convention. 884 885 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the 886 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this 887 attribute, or reached through a call site marked with this attribute, 888 the value returned by the intrinsic is undefined. The backend can 889 generally infer this during code generation, so typically there is no 890 benefit to frontends marking functions with this. 891 892 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the 893 llvm.amdgcn.workitem.id.y intrinsic. 894 895 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the 896 llvm.amdgcn.workitem.id.z intrinsic. 897 898 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the 899 llvm.amdgcn.workgroup.id.x intrinsic. 900 901 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the 902 llvm.amdgcn.workgroup.id.y intrinsic. 903 904 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the 905 llvm.amdgcn.workgroup.id.z intrinsic. 906 907 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the 908 llvm.amdgcn.dispatch.ptr intrinsic. 909 910 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the 911 llvm.amdgcn.implicitarg.ptr intrinsic. 912 913 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the 914 llvm.amdgcn.dispatch.id intrinsic. 915 916 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the 917 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint 918 attributes, the queue pointer may be required in situations where the 919 intrinsic call does not directly appear in the program. Some subtargets 920 require the queue pointer for to handle some addrspacecasts, as well 921 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and 922 llvm.debug intrinsics. 923 924 ======================================= ========================================================== 925 926.. _amdgpu-elf-code-object: 927 928ELF Code Object 929=============== 930 931The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 932can be linked by ``lld`` to produce a standard ELF shared code object which can 933be loaded and executed on an AMDGPU target. 934 935.. _amdgpu-elf-header: 936 937Header 938------ 939 940The AMDGPU backend uses the following ELF header: 941 942 .. table:: AMDGPU ELF Header 943 :name: amdgpu-elf-header-table 944 945 ========================== =============================== 946 Field Value 947 ========================== =============================== 948 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 949 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 950 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 951 - ``ELFOSABI_AMDGPU_HSA`` 952 - ``ELFOSABI_AMDGPU_PAL`` 953 - ``ELFOSABI_AMDGPU_MESA3D`` 954 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 955 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 956 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 957 - ``ELFABIVERSION_AMDGPU_PAL`` 958 - ``ELFABIVERSION_AMDGPU_MESA3D`` 959 ``e_type`` - ``ET_REL`` 960 - ``ET_DYN`` 961 ``e_machine`` ``EM_AMDGPU`` 962 ``e_entry`` 0 963 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 964 :ref:`amdgpu-elf-header-e_flags-table-v3`, 965 and :ref:`amdgpu-elf-header-e_flags-table-v4` 966 ========================== =============================== 967 968.. 969 970 .. table:: AMDGPU ELF Header Enumeration Values 971 :name: amdgpu-elf-header-enumeration-values-table 972 973 =============================== ===== 974 Name Value 975 =============================== ===== 976 ``EM_AMDGPU`` 224 977 ``ELFOSABI_NONE`` 0 978 ``ELFOSABI_AMDGPU_HSA`` 64 979 ``ELFOSABI_AMDGPU_PAL`` 65 980 ``ELFOSABI_AMDGPU_MESA3D`` 66 981 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 982 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 983 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 984 ``ELFABIVERSION_AMDGPU_PAL`` 0 985 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 986 =============================== ===== 987 988``e_ident[EI_CLASS]`` 989 The ELF class is: 990 991 * ``ELFCLASS32`` for ``r600`` architecture. 992 993 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 994 process address space applications. 995 996``e_ident[EI_DATA]`` 997 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 998 999``e_ident[EI_OSABI]`` 1000 One of the following AMDGPU target architecture specific OS ABIs 1001 (see :ref:`amdgpu-os`): 1002 1003 * ``ELFOSABI_NONE`` for *unknown* OS. 1004 1005 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 1006 1007 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 1008 1009 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 1010 1011``e_ident[EI_ABIVERSION]`` 1012 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 1013 object conforms: 1014 1015 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 1016 runtime ABI for code object V2. Specify using the Clang option 1017 ``-mcode-object-version=2``. 1018 1019 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 1020 runtime ABI for code object V3. Specify using the Clang option 1021 ``-mcode-object-version=3``. This is the default code object 1022 version if not specified. 1023 1024 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 1025 runtime ABI for code object V4. Specify using the Clang option 1026 ``-mcode-object-version=4``. 1027 1028 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 1029 runtime ABI. 1030 1031 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 1032 3D runtime ABI. 1033 1034``e_type`` 1035 Can be one of the following values: 1036 1037 1038 ``ET_REL`` 1039 The type produced by the AMDGPU backend compiler as it is relocatable code 1040 object. 1041 1042 ``ET_DYN`` 1043 The type produced by the linker as it is a shared code object. 1044 1045 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 1046 1047``e_machine`` 1048 The value ``EM_AMDGPU`` is used for the machine for all processors supported 1049 by the ``r600`` and ``amdgcn`` architectures (see 1050 :ref:`amdgpu-processor-table`). The specific processor is specified in the 1051 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 1052 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1053 ``e_flags`` for code object V3 to V4 (see 1054 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1055 :ref:`amdgpu-elf-header-e_flags-table-v4`). 1056 1057``e_entry`` 1058 The entry point is 0 as the entry points for individual kernels must be 1059 selected in order to invoke them through AQL packets. 1060 1061``e_flags`` 1062 The AMDGPU backend uses the following ELF header flags: 1063 1064 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1065 :name: amdgpu-elf-header-e_flags-v2-table 1066 1067 ===================================== ===== ============================= 1068 Name Value Description 1069 ===================================== ===== ============================= 1070 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1071 target feature is 1072 enabled for all code 1073 contained in the code object. 1074 If the processor 1075 does not support the 1076 ``xnack`` target 1077 feature then must 1078 be 0. 1079 See 1080 :ref:`amdgpu-target-features`. 1081 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1082 handler is enabled for all 1083 code contained in the code 1084 object. If the processor 1085 does not support a trap 1086 handler then must be 0. 1087 See 1088 :ref:`amdgpu-target-features`. 1089 ===================================== ===== ============================= 1090 1091 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1092 :name: amdgpu-elf-header-e_flags-table-v3 1093 1094 ================================= ===== ============================= 1095 Name Value Description 1096 ================================= ===== ============================= 1097 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1098 mask for 1099 ``EF_AMDGPU_MACH_xxx`` values 1100 defined in 1101 :ref:`amdgpu-ef-amdgpu-mach-table`. 1102 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1103 target feature is 1104 enabled for all code 1105 contained in the code object. 1106 If the processor 1107 does not support the 1108 ``xnack`` target 1109 feature then must 1110 be 0. 1111 See 1112 :ref:`amdgpu-target-features`. 1113 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1114 target feature is 1115 enabled for all code 1116 contained in the code object. 1117 If the processor 1118 does not support the 1119 ``sramecc`` target 1120 feature then must 1121 be 0. 1122 See 1123 :ref:`amdgpu-target-features`. 1124 ================================= ===== ============================= 1125 1126 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 1127 :name: amdgpu-elf-header-e_flags-table-v4 1128 1129 ============================================ ===== =================================== 1130 Name Value Description 1131 ============================================ ===== =================================== 1132 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1133 mask for 1134 ``EF_AMDGPU_MACH_xxx`` values 1135 defined in 1136 :ref:`amdgpu-ef-amdgpu-mach-table`. 1137 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1138 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1139 values. 1140 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1141 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1142 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1143 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1144 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1145 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1146 values. 1147 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1148 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1149 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1150 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1151 ============================================ ===== =================================== 1152 1153 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1154 :name: amdgpu-ef-amdgpu-mach-table 1155 1156 ==================================== ========== ============================= 1157 Name Value Description (see 1158 :ref:`amdgpu-processor-table`) 1159 ==================================== ========== ============================= 1160 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1161 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1162 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1163 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1164 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1165 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1166 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1167 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1168 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1169 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1170 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1171 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1172 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1173 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1174 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1175 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1176 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1177 *reserved* 0x011 - Reserved for ``r600`` 1178 0x01f architecture processors. 1179 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1180 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1181 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1182 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1183 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1184 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1185 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1186 *reserved* 0x027 Reserved. 1187 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1188 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1189 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1190 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1191 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1192 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1193 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1194 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1195 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1196 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1197 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1198 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1199 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1200 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1201 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1202 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1203 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1204 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1205 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1206 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1207 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1208 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 1209 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1210 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1211 *reserved* 0x040 Reserved. 1212 *reserved* 0x041 Reserved. 1213 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 1214 *reserved* 0x043 Reserved. 1215 *reserved* 0x044 Reserved. 1216 *reserved* 0x045 Reserved. 1217 ==================================== ========== ============================= 1218 1219Sections 1220-------- 1221 1222An AMDGPU target ELF code object has the standard ELF sections which include: 1223 1224 .. table:: AMDGPU ELF Sections 1225 :name: amdgpu-elf-sections-table 1226 1227 ================== ================ ================================= 1228 Name Type Attributes 1229 ================== ================ ================================= 1230 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1231 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1232 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1233 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1234 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1235 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1236 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1237 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1238 ``.note`` ``SHT_NOTE`` *none* 1239 ``.rela``\ *name* ``SHT_RELA`` *none* 1240 ``.rela.dyn`` ``SHT_RELA`` *none* 1241 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1242 ``.shstrtab`` ``SHT_STRTAB`` *none* 1243 ``.strtab`` ``SHT_STRTAB`` *none* 1244 ``.symtab`` ``SHT_SYMTAB`` *none* 1245 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1246 ================== ================ ================================= 1247 1248These sections have their standard meanings (see [ELF]_) and are only generated 1249if needed. 1250 1251``.debug``\ *\** 1252 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1253 information on the DWARF produced by the AMDGPU backend. 1254 1255``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1256 The standard sections used by a dynamic loader. 1257 1258``.note`` 1259 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1260 backend. 1261 1262``.rela``\ *name*, ``.rela.dyn`` 1263 For relocatable code objects, *name* is the name of the section that the 1264 relocation records apply. For example, ``.rela.text`` is the section name for 1265 relocation records associated with the ``.text`` section. 1266 1267 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1268 records from each of the relocatable code object's ``.rela``\ *name* sections. 1269 1270 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1271 the AMDGPU backend. 1272 1273``.text`` 1274 The executable machine code for the kernels and functions they call. Generated 1275 as position independent code. See :ref:`amdgpu-code-conventions` for 1276 information on conventions used in the isa generation. 1277 1278.. _amdgpu-note-records: 1279 1280Note Records 1281------------ 1282 1283The AMDGPU backend code object contains ELF note records in the ``.note`` 1284section. The set of generated notes and their semantics depend on the code 1285object version; see :ref:`amdgpu-note-records-v2` and 1286:ref:`amdgpu-note-records-v3-v4`. 1287 1288As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1289must be generated after the ``name`` field to ensure the ``desc`` field is 4 1290byte aligned. In addition, minimal zero-byte padding must be generated to 1291ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1292field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1293alignment. 1294 1295.. _amdgpu-note-records-v2: 1296 1297Code Object V2 Note Records 1298~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1299 1300.. warning:: 1301 Code object V2 is not the default code object version emitted by 1302 this version of LLVM. 1303 1304The AMDGPU backend code object uses the following ELF note record in the 1305``.note`` section when compiling for code object V2. 1306 1307The note record vendor field is "AMD". 1308 1309Additional note records may be present, but any which are not documented here 1310are deprecated and should not be used. 1311 1312 .. table:: AMDGPU Code Object V2 ELF Note Records 1313 :name: amdgpu-elf-note-records-v2-table 1314 1315 ===== ===================================== ====================================== 1316 Name Type Description 1317 ===== ===================================== ====================================== 1318 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1319 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1320 Finalizer and not the LLVM compiler. 1321 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1322 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1323 YAML [YAML]_ textual format. 1324 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1325 ===== ===================================== ====================================== 1326 1327.. 1328 1329 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1330 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1331 1332 ===================================== ===== 1333 Name Value 1334 ===================================== ===== 1335 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1336 ``NT_AMD_HSA_HSAIL`` 2 1337 ``NT_AMD_HSA_ISA_VERSION`` 3 1338 *reserved* 4-9 1339 ``NT_AMD_HSA_METADATA`` 10 1340 ``NT_AMD_HSA_ISA_NAME`` 11 1341 ===================================== ===== 1342 1343``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1344 Specifies the code object version number. The description field has the 1345 following layout: 1346 1347 .. code:: c 1348 1349 struct amdgpu_hsa_note_code_object_version_s { 1350 uint32_t major_version; 1351 uint32_t minor_version; 1352 }; 1353 1354 The ``major_version`` has a value less than or equal to 2. 1355 1356``NT_AMD_HSA_HSAIL`` 1357 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1358 field has the following layout: 1359 1360 .. code:: c 1361 1362 struct amdgpu_hsa_note_hsail_s { 1363 uint32_t hsail_major_version; 1364 uint32_t hsail_minor_version; 1365 uint8_t profile; 1366 uint8_t machine_model; 1367 uint8_t default_float_round; 1368 }; 1369 1370``NT_AMD_HSA_ISA_VERSION`` 1371 Specifies the target ISA version. The description field has the following layout: 1372 1373 .. code:: c 1374 1375 struct amdgpu_hsa_note_isa_s { 1376 uint16_t vendor_name_size; 1377 uint16_t architecture_name_size; 1378 uint32_t major; 1379 uint32_t minor; 1380 uint32_t stepping; 1381 char vendor_and_architecture_name[1]; 1382 }; 1383 1384 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1385 vendor and architecture names respectively, including the NUL character. 1386 1387 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1388 vendor, immediately followed by the NUL terminated string for the 1389 architecture. 1390 1391 This note record is used by the HSA runtime loader. 1392 1393 Code object V2 only supports a limited number of processors and has fixed 1394 settings for target features. See 1395 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1396 processors and the corresponding target ID. In the table the note record ISA 1397 name is a concatenation of the vendor name, architecture name, major, minor, 1398 and stepping separated by a ":". 1399 1400 The target ID column shows the processor name and fixed target features used 1401 by the LLVM compiler. The LLVM compiler does not generate a 1402 ``NT_AMD_HSA_HSAIL`` note record. 1403 1404 A code object generated by the Finalizer also uses code object V2 and always 1405 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1406 ``sramecc`` target feature is as shown in 1407 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1408 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1409 bit. 1410 1411``NT_AMD_HSA_ISA_NAME`` 1412 Specifies the target ISA name as a non-NUL terminated string. 1413 1414 This note record is not used by the HSA runtime loader. 1415 1416 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1417 V2's limited support of processors and fixed settings for target features. 1418 1419 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1420 from the string to the corresponding target ID. If the ``xnack`` target 1421 feature is supported and enabled, the string produced by the LLVM compiler 1422 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1423 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1424 1425``NT_AMD_HSA_METADATA`` 1426 Specifies extensible metadata associated with the code objects executed on HSA 1427 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1428 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1429 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1430 metadata string. 1431 1432 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1433 :name: amdgpu-elf-note-record-supported_processors-v2-table 1434 1435 ===================== ========================== 1436 Note Record ISA Name Target ID 1437 ===================== ========================== 1438 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1439 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1440 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1441 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1442 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1443 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1444 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1445 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1446 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1447 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1448 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1449 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1450 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1451 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1452 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1453 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1454 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1455 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1456 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1457 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1458 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1459 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1460 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1461 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1462 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1463 ===================== ========================== 1464 1465.. _amdgpu-note-records-v3-v4: 1466 1467Code Object V3 to V4 Note Records 1468~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1469 1470The AMDGPU backend code object uses the following ELF note record in the 1471``.note`` section when compiling for code object V3 to V4. 1472 1473The note record vendor field is "AMDGPU". 1474 1475Additional note records may be present, but any which are not documented here 1476are deprecated and should not be used. 1477 1478 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records 1479 :name: amdgpu-elf-note-records-table-v3-v4 1480 1481 ======== ============================== ====================================== 1482 Name Type Description 1483 ======== ============================== ====================================== 1484 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1485 binary format. 1486 ======== ============================== ====================================== 1487 1488.. 1489 1490 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values 1491 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4 1492 1493 ============================== ===== 1494 Name Value 1495 ============================== ===== 1496 *reserved* 0-31 1497 ``NT_AMDGPU_METADATA`` 32 1498 ============================== ===== 1499 1500``NT_AMDGPU_METADATA`` 1501 Specifies extensible metadata associated with an AMDGPU code object. It is 1502 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1503 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and 1504 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the 1505 ``amdhsa`` OS. 1506 1507.. _amdgpu-symbols: 1508 1509Symbols 1510------- 1511 1512Symbols include the following: 1513 1514 .. table:: AMDGPU ELF Symbols 1515 :name: amdgpu-elf-symbols-table 1516 1517 ===================== ================== ================ ================== 1518 Name Type Section Description 1519 ===================== ================== ================ ================== 1520 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1521 - ``.rodata`` 1522 - ``.bss`` 1523 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1524 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1525 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1526 ===================== ================== ================ ================== 1527 1528Global variable 1529 Global variables both used and defined by the compilation unit. 1530 1531 If the symbol is defined in the compilation unit then it is allocated in the 1532 appropriate section according to if it has initialized data or is readonly. 1533 1534 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1535 will resolve relocations using the definition provided by another code object 1536 or explicitly defined by the runtime. 1537 1538 If the symbol resides in local/group memory (LDS) then its section is the 1539 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1540 ``st_value`` field describes alignment requirements as it does for common 1541 symbols. 1542 1543 .. TODO:: 1544 1545 Add description of linked shared object symbols. Seems undefined symbols 1546 are marked as STT_NOTYPE. 1547 1548Kernel descriptor 1549 Every HSA kernel has an associated kernel descriptor. It is the address of the 1550 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1551 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1552 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1553 1554Kernel entry point 1555 Every HSA kernel also has a symbol for its machine code entry point. 1556 1557.. _amdgpu-relocation-records: 1558 1559Relocation Records 1560------------------ 1561 1562AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1563relocatable fields are: 1564 1565``word32`` 1566 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1567 alignment. These values use the same byte order as other word values in the 1568 AMDGPU architecture. 1569 1570``word64`` 1571 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1572 alignment. These values use the same byte order as other word values in the 1573 AMDGPU architecture. 1574 1575Following notations are used for specifying relocation calculations: 1576 1577**A** 1578 Represents the addend used to compute the value of the relocatable field. 1579 1580**G** 1581 Represents the offset into the global offset table at which the relocation 1582 entry's symbol will reside during execution. 1583 1584**GOT** 1585 Represents the address of the global offset table. 1586 1587**P** 1588 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1589 of the storage unit being relocated (computed using ``r_offset``). 1590 1591**S** 1592 Represents the value of the symbol whose index resides in the relocation 1593 entry. Relocations not using this must specify a symbol index of 1594 ``STN_UNDEF``. 1595 1596**B** 1597 Represents the base address of a loaded executable or shared object which is 1598 the difference between the ELF address and the actual load address. 1599 Relocations using this are only valid in executable or shared objects. 1600 1601The following relocation types are supported: 1602 1603 .. table:: AMDGPU ELF Relocation Records 1604 :name: amdgpu-elf-relocation-records-table 1605 1606 ========================== ======= ===== ========== ============================== 1607 Relocation Type Kind Value Field Calculation 1608 ========================== ======= ===== ========== ============================== 1609 ``R_AMDGPU_NONE`` 0 *none* *none* 1610 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1611 Dynamic 1612 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1613 Dynamic 1614 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1615 Dynamic 1616 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1617 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1618 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1619 Dynamic 1620 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1621 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1622 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1623 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1624 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1625 *reserved* 12 1626 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1627 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 1628 ========================== ======= ===== ========== ============================== 1629 1630``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1631the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1632 1633There is no current OS loader support for 32-bit programs and so 1634``R_AMDGPU_ABS32`` is not used. 1635 1636.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1637 1638Loaded Code Object Path Uniform Resource Identifier (URI) 1639--------------------------------------------------------- 1640 1641The AMD GPU code object loader represents the path of the ELF shared object from 1642which the code object was loaded as a textual Uniform Resource Identifier (URI). 1643Note that the code object is the in memory loaded relocated form of the ELF 1644shared object. Multiple code objects may be loaded at different memory 1645addresses in the same process from the same ELF shared object. 1646 1647The loaded code object path URI syntax is defined by the following BNF syntax: 1648 1649.. code:: 1650 1651 code_object_uri ::== file_uri | memory_uri 1652 file_uri ::== "file://" file_path [ range_specifier ] 1653 memory_uri ::== "memory://" process_id range_specifier 1654 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1655 file_path ::== URI_ENCODED_OS_FILE_PATH 1656 process_id ::== DECIMAL_NUMBER 1657 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1658 1659**number** 1660 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1661 and octal values by "0". 1662 1663**file_path** 1664 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1665 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1666 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1667 the path are separated by "/". 1668 1669**offset** 1670 Is a 0-based byte offset to the start of the code object. For a file URI, it 1671 is from the start of the file specified by the ``file_path``, and if omitted 1672 defaults to 0. For a memory URI, it is the memory address and is required. 1673 1674**size** 1675 Is the number of bytes in the code object. For a file URI, if omitted it 1676 defaults to the size of the file. It is required for a memory URI. 1677 1678**process_id** 1679 Is the identity of the process owning the memory. For Linux it is the C 1680 unsigned integral decimal literal for the process ID (PID). 1681 1682For example: 1683 1684.. code:: 1685 1686 file:///dir1/dir2/file1 1687 file:///dir3/dir4/file2#offset=0x2000&size=3000 1688 memory://1234#offset=0x20000&size=3000 1689 1690.. _amdgpu-dwarf-debug-information: 1691 1692DWARF Debug Information 1693======================= 1694 1695.. warning:: 1696 1697 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1698 is not currently fully implemented and is subject to change. 1699 1700AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1701:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1702object executable code and data to the source language constructs. It can be 1703used by tools such as debuggers and profilers. It uses features defined in 1704:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1705DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1706 1707This section defines the AMDGPU target architecture specific DWARF mappings. 1708 1709.. _amdgpu-dwarf-register-identifier: 1710 1711Register Identifier 1712------------------- 1713 1714This section defines the AMDGPU target architecture register numbers used in 1715DWARF operation expressions (see DWARF Version 5 section 2.5 and 1716:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1717instructions (see DWARF Version 5 section 6.4 and 1718:ref:`amdgpu-dwarf-call-frame-information`). 1719 1720A single code object can contain code for kernels that have different wavefront 1721sizes. The vector registers and some scalar registers are based on the wavefront 1722size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1723simplifies the consumer of the DWARF so that each register has a fixed size, 1724rather than being dynamic according to the wavefront size mode. Similarly, 1725distinct DWARF registers are defined for those registers that vary in size 1726according to the process address size. This allows a consumer to treat a 1727specific AMDGPU processor as a single architecture regardless of how it is 1728configured at run time. The compiler explicitly specifies the DWARF registers 1729that match the mode in which the code it is generating will be executed. 1730 1731DWARF registers are encoded as numbers, which are mapped to architecture 1732registers. The mapping for AMDGPU is defined in 1733:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1734mapping. 1735 1736.. table:: AMDGPU DWARF Register Mapping 1737 :name: amdgpu-dwarf-register-mapping-table 1738 1739 ============== ================= ======== ================================== 1740 DWARF Register AMDGPU Register Bit Size Description 1741 ============== ================= ======== ================================== 1742 0 PC_32 32 Program Counter (PC) when 1743 executing in a 32-bit process 1744 address space. Used in the CFI to 1745 describe the PC of the calling 1746 frame. 1747 1 EXEC_MASK_32 32 Execution Mask Register when 1748 executing in wavefront 32 mode. 1749 2-15 *Reserved* *Reserved for highly accessed 1750 registers using DWARF shortcut.* 1751 16 PC_64 64 Program Counter (PC) when 1752 executing in a 64-bit process 1753 address space. Used in the CFI to 1754 describe the PC of the calling 1755 frame. 1756 17 EXEC_MASK_64 64 Execution Mask Register when 1757 executing in wavefront 64 mode. 1758 18-31 *Reserved* *Reserved for highly accessed 1759 registers using DWARF shortcut.* 1760 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1761 Registers. 1762 96-127 *Reserved* *Reserved for frequently accessed 1763 registers using DWARF 1-byte ULEB.* 1764 128 STATUS 32 Status Register. 1765 129-511 *Reserved* *Reserved for future Scalar 1766 Architectural Registers.* 1767 512 VCC_32 32 Vector Condition Code Register 1768 when executing in wavefront 32 1769 mode. 1770 513-767 *Reserved* *Reserved for future Vector 1771 Architectural Registers when 1772 executing in wavefront 32 mode.* 1773 768 VCC_64 64 Vector Condition Code Register 1774 when executing in wavefront 64 1775 mode. 1776 769-1023 *Reserved* *Reserved for future Vector 1777 Architectural Registers when 1778 executing in wavefront 64 mode.* 1779 1024-1087 *Reserved* *Reserved for padding.* 1780 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1781 1130-1535 *Reserved* *Reserved for future Scalar 1782 General Purpose Registers.* 1783 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1784 when executing in wavefront 32 1785 mode. 1786 1792-2047 *Reserved* *Reserved for future Vector 1787 General Purpose Registers when 1788 executing in wavefront 32 mode.* 1789 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1790 when executing in wavefront 32 1791 mode. 1792 2304-2559 *Reserved* *Reserved for future Vector 1793 Accumulation Registers when 1794 executing in wavefront 32 mode.* 1795 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1796 when executing in wavefront 64 1797 mode. 1798 2816-3071 *Reserved* *Reserved for future Vector 1799 General Purpose Registers when 1800 executing in wavefront 64 mode.* 1801 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1802 when executing in wavefront 64 1803 mode. 1804 3328-3583 *Reserved* *Reserved for future Vector 1805 Accumulation Registers when 1806 executing in wavefront 64 mode.* 1807 ============== ================= ======== ================================== 1808 1809The vector registers are represented as the full size for the wavefront. They 1810are organized as consecutive dwords (32-bits), one per lane, with the dword at 1811the least significant bit position corresponding to lane 0 and so forth. DWARF 1812location expressions involving the ``DW_OP_LLVM_offset`` and 1813``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1814register corresponding to the lane that is executing the current thread of 1815execution in languages that are implemented using a SIMD or SIMT execution 1816model. 1817 1818If the wavefront size is 32 lanes then the wavefront 32 mode register 1819definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1820mode register definitions are used. Some AMDGPU targets support executing in 1821both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1822to the wavefront mode of the generated code will be used. 1823 1824If code is generated to execute in a 32-bit process address space, then the 182532-bit process address space register definitions are used. If code is generated 1826to execute in a 64-bit process address space, then the 64-bit process address 1827space register definitions are used. The ``amdgcn`` target only supports the 182864-bit process address space. 1829 1830.. _amdgpu-dwarf-address-class-identifier: 1831 1832Address Class Identifier 1833------------------------ 1834 1835The DWARF address class represents the source language memory space. See DWARF 1836Version 5 section 2.12 which is updated by the *DWARF Extensions For 1837Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1838 1839The DWARF address class mapping used for AMDGPU is defined in 1840:ref:`amdgpu-dwarf-address-class-mapping-table`. 1841 1842.. table:: AMDGPU DWARF Address Class Mapping 1843 :name: amdgpu-dwarf-address-class-mapping-table 1844 1845 ========================= ====== ================= 1846 DWARF AMDGPU 1847 -------------------------------- ----------------- 1848 Address Class Name Value Address Space 1849 ========================= ====== ================= 1850 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1851 ``DW_ADDR_LLVM_global`` 0x0001 Global 1852 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1853 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1854 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1855 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1856 ========================= ====== ================= 1857 1858The DWARF address class values defined in the *DWARF Extensions For 1859Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1860 1861In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1862available for use for the AMD extension for access to the hardware GDS memory 1863which is scratchpad memory allocated per device. 1864 1865For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1866address class of ``DW_ADDR_none`` is used. 1867 1868See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1869mapping of DWARF address classes to DWARF address spaces, including address size 1870and NULL value. 1871 1872.. _amdgpu-dwarf-address-space-identifier: 1873 1874Address Space Identifier 1875------------------------ 1876 1877DWARF address spaces correspond to target architecture specific linear 1878addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1879For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1880 1881The DWARF address space mapping used for AMDGPU is defined in 1882:ref:`amdgpu-dwarf-address-space-mapping-table`. 1883 1884.. table:: AMDGPU DWARF Address Space Mapping 1885 :name: amdgpu-dwarf-address-space-mapping-table 1886 1887 ======================================= ===== ======= ======== ================= ======================= 1888 DWARF AMDGPU Notes 1889 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1890 Address Space Name Value Address Bit Size Address Space 1891 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1892 .. 64-bit 32-bit 1893 process process 1894 address address 1895 space space 1896 ======================================= ===== ======= ======== ================= ======================= 1897 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1898 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1899 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1900 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1901 *Reserved* 0x04 1902 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1903 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1904 ======================================= ===== ======= ======== ================= ======================= 1905 1906See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1907including address size and NULL value. 1908 1909The ``DW_ASPACE_none`` address space is the default target architecture address 1910space used in DWARF operations that do not specify an address space. It 1911therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1912related operations can refer to addresses in the program code. 1913 1914The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1915specify the flat address space. If the address corresponds to an address in the 1916local address space, then it corresponds to the wavefront that is executing the 1917focused thread of execution. If the address corresponds to an address in the 1918private address space, then it corresponds to the lane that is executing the 1919focused thread of execution for languages that are implemented using a SIMD or 1920SIMT execution model. 1921 1922.. note:: 1923 1924 CUDA-like languages such as HIP that do not have address spaces in the 1925 language type system, but do allow variables to be allocated in different 1926 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1927 address space in the DWARF expression operations as the default address space 1928 is the global address space. 1929 1930The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1931specify the local address space corresponding to the wavefront that is executing 1932the focused thread of execution. 1933 1934The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1935to specify the private address space corresponding to the lane that is executing 1936the focused thread of execution for languages that are implemented using a SIMD 1937or SIMT execution model. 1938 1939The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1940to specify the unswizzled private address space corresponding to the wavefront 1941that is executing the focused thread of execution. The wavefront view of private 1942memory is the per wavefront unswizzled backing memory layout defined in 1943:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1944location for the backing memory of the wavefront (namely the address is not 1945offset by ``wavefront-scratch-base``). The following formula can be used to 1946convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1947``DW_ASPACE_AMDGPU_private_wave`` address: 1948 1949:: 1950 1951 private-address-wavefront = 1952 ((private-address-lane / 4) * wavefront-size * 4) + 1953 (wavefront-lane-id * 4) + (private-address-lane % 4) 1954 1955If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1956of the dwords for each lane starting with lane 0 is required, then this 1957simplifies to: 1958 1959:: 1960 1961 private-address-wavefront = 1962 private-address-lane * wavefront-size 1963 1964A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1965complete spilled vector register back into a complete vector register in the 1966CFI. The frame pointer can be a private lane address which is dword aligned, 1967which can be shifted to multiply by the wavefront size, and then used to form a 1968private wavefront address that gives a location for a contiguous set of dwords, 1969one per lane, where the vector register dwords are spilled. The compiler knows 1970the wavefront size since it generates the code. Note that the type of the 1971address may have to be converted as the size of a 1972``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1973``DW_ASPACE_AMDGPU_private_wave`` address. 1974 1975.. _amdgpu-dwarf-lane-identifier: 1976 1977Lane identifier 1978--------------- 1979 1980DWARF lane identifies specify a target architecture lane position for hardware 1981that executes in a SIMD or SIMT manner, and on which a source language maps its 1982threads of execution onto those lanes. The DWARF lane identifier is pushed by 1983the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1984section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 1985section :ref:`amdgpu-dwarf-operation-expressions`. 1986 1987For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1988wavefront. It is numbered from 0 to the wavefront size minus 1. 1989 1990Operation Expressions 1991--------------------- 1992 1993DWARF expressions are used to compute program values and the locations of 1994program objects. See DWARF Version 5 section 2.5 and 1995:ref:`amdgpu-dwarf-operation-expressions`. 1996 1997DWARF location descriptions describe how to access storage which includes memory 1998and registers. When accessing storage on AMDGPU, bytes are ordered with least 1999significant bytes first, and bits are ordered within bytes with least 2000significant bits first. 2001 2002For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 2003unwinding vector registers that are spilled under the execution mask to memory: 2004the zero-single location description is the vector register, and the one-single 2005location description is the spilled memory location description. The 2006``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 2007memory location description. 2008 2009In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 2010``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 2011controlled by the execution mask. An undefined location description together 2012with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 2013to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 2014 2015Debugger Information Entry Attributes 2016------------------------------------- 2017 2018This section describes how certain debugger information entry attributes are 2019used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1 2020which are updated by *DWARF Extensions For Heterogeneous Debugging* section 2021:ref:`amdgpu-dwarf-low-level-information` and 2022:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`. 2023 2024.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 2025 2026``DW_AT_LLVM_lane_pc`` 2027~~~~~~~~~~~~~~~~~~~~~~ 2028 2029For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 2030location of the separate lanes of a SIMT thread. 2031 2032If the lane is an active lane then this will be the same as the current program 2033location. 2034 2035If the lane is inactive, but was active on entry to the subprogram, then this is 2036the program location in the subprogram at which execution of the lane is 2037conceptual positioned. 2038 2039If the lane was not active on entry to the subprogram, then this will be the 2040undefined location. A client debugger can check if the lane is part of a valid 2041work-group by checking that the lane is in the range of the associated 2042work-group within the grid, accounting for partial work-groups. If it is not, 2043then the debugger can omit any information for the lane. Otherwise, the debugger 2044may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 2045calling subprogram until it finds a non-undefined location. Conceptually the 2046lane only has the call frames that it has a non-undefined 2047``DW_AT_LLVM_lane_pc``. 2048 2049The following example illustrates how the AMDGPU backend can generate a DWARF 2050location list expression for the nested ``IF/THEN/ELSE`` structures of the 2051following subprogram pseudo code for a target with 64 lanes per wavefront. 2052 2053.. code:: 2054 :number-lines: 2055 2056 SUBPROGRAM X 2057 BEGIN 2058 a; 2059 IF (c1) THEN 2060 b; 2061 IF (c2) THEN 2062 c; 2063 ELSE 2064 d; 2065 ENDIF 2066 e; 2067 ELSE 2068 f; 2069 ENDIF 2070 g; 2071 END 2072 2073The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2074execution mask (``EXEC``) to linearize the control flow. The condition is 2075evaluated to make a mask of the lanes for which the condition evaluates to true. 2076First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2077logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2078``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2079the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2080region the ``EXEC`` mask is restored to the value it had at the beginning of the 2081region. This is shown below. Other approaches are possible, but the basic 2082concept is the same. 2083 2084.. code:: 2085 :number-lines: 2086 2087 $lex_start: 2088 a; 2089 %1 = EXEC 2090 %2 = c1 2091 $lex_1_start: 2092 EXEC = %1 & %2 2093 $if_1_then: 2094 b; 2095 %3 = EXEC 2096 %4 = c2 2097 $lex_1_1_start: 2098 EXEC = %3 & %4 2099 $lex_1_1_then: 2100 c; 2101 EXEC = ~EXEC & %3 2102 $lex_1_1_else: 2103 d; 2104 EXEC = %3 2105 $lex_1_1_end: 2106 e; 2107 EXEC = ~EXEC & %1 2108 $lex_1_else: 2109 f; 2110 EXEC = %1 2111 $lex_1_end: 2112 g; 2113 $lex_end: 2114 2115To create the DWARF location list expression that defines the location 2116description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2117pseudo instruction can be used to annotate the linearized control flow. This can 2118be done by defining an artificial variable for the lane PC. The DWARF location 2119list expression created for it is used as the value of the 2120``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2121 2122A DWARF procedure is defined for each well nested structured control flow region 2123which provides the conceptual lane program location for a lane if it is not 2124active (namely it is divergent). The DWARF operation expression for each region 2125conceptually inherits the value of the immediately enclosing region and modifies 2126it according to the semantics of the region. 2127 2128For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2129the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2130region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2131region since the ``THEN`` region has completed. 2132 2133The lane PC artificial variable is assigned at each region transition. It uses 2134the immediately enclosing region's DWARF procedure to compute the program 2135location for each lane assuming they are divergent, and then modifies the result 2136by inserting the current program location for each lane that the ``EXEC`` mask 2137indicates is active. 2138 2139By having separate DWARF procedures for each region, they can be reused to 2140define the value for any nested region. This reduces the total size of the DWARF 2141operation expressions. 2142 2143The following provides an example using pseudo LLVM MIR. 2144 2145.. code:: 2146 :number-lines: 2147 2148 $lex_start: 2149 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2150 DW_AT_name = "__uint64"; 2151 DW_AT_byte_size = 8; 2152 DW_AT_encoding = DW_ATE_unsigned; 2153 ]; 2154 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2155 DW_AT_name = "__active_lane_pc"; 2156 DW_AT_location = [ 2157 DW_OP_regx PC; 2158 DW_OP_LLVM_extend 64, 64; 2159 DW_OP_regval_type EXEC, %uint_64; 2160 DW_OP_LLVM_select_bit_piece 64, 64; 2161 ]; 2162 ]; 2163 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2164 DW_AT_name = "__divergent_lane_pc"; 2165 DW_AT_location = [ 2166 DW_OP_LLVM_undefined; 2167 DW_OP_LLVM_extend 64, 64; 2168 ]; 2169 ]; 2170 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2171 DW_OP_call_ref %__divergent_lane_pc; 2172 DW_OP_call_ref %__active_lane_pc; 2173 ]; 2174 a; 2175 %1 = EXEC; 2176 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2177 %2 = c1; 2178 $lex_1_start: 2179 EXEC = %1 & %2; 2180 $lex_1_then: 2181 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2182 DW_AT_name = "__divergent_lane_pc_1_then"; 2183 DW_AT_location = DIExpression[ 2184 DW_OP_call_ref %__divergent_lane_pc; 2185 DW_OP_addrx &lex_1_start; 2186 DW_OP_stack_value; 2187 DW_OP_LLVM_extend 64, 64; 2188 DW_OP_call_ref %__lex_1_save_exec; 2189 DW_OP_deref_type 64, %__uint_64; 2190 DW_OP_LLVM_select_bit_piece 64, 64; 2191 ]; 2192 ]; 2193 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2194 DW_OP_call_ref %__divergent_lane_pc_1_then; 2195 DW_OP_call_ref %__active_lane_pc; 2196 ]; 2197 b; 2198 %3 = EXEC; 2199 DBG_VALUE %3, %__lex_1_1_save_exec; 2200 %4 = c2; 2201 $lex_1_1_start: 2202 EXEC = %3 & %4; 2203 $lex_1_1_then: 2204 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2205 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2206 DW_AT_location = DIExpression[ 2207 DW_OP_call_ref %__divergent_lane_pc_1_then; 2208 DW_OP_addrx &lex_1_1_start; 2209 DW_OP_stack_value; 2210 DW_OP_LLVM_extend 64, 64; 2211 DW_OP_call_ref %__lex_1_1_save_exec; 2212 DW_OP_deref_type 64, %__uint_64; 2213 DW_OP_LLVM_select_bit_piece 64, 64; 2214 ]; 2215 ]; 2216 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2217 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2218 DW_OP_call_ref %__active_lane_pc; 2219 ]; 2220 c; 2221 EXEC = ~EXEC & %3; 2222 $lex_1_1_else: 2223 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2224 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2225 DW_AT_location = DIExpression[ 2226 DW_OP_call_ref %__divergent_lane_pc_1_then; 2227 DW_OP_addrx &lex_1_1_end; 2228 DW_OP_stack_value; 2229 DW_OP_LLVM_extend 64, 64; 2230 DW_OP_call_ref %__lex_1_1_save_exec; 2231 DW_OP_deref_type 64, %__uint_64; 2232 DW_OP_LLVM_select_bit_piece 64, 64; 2233 ]; 2234 ]; 2235 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2236 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2237 DW_OP_call_ref %__active_lane_pc; 2238 ]; 2239 d; 2240 EXEC = %3; 2241 $lex_1_1_end: 2242 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2243 DW_OP_call_ref %__divergent_lane_pc; 2244 DW_OP_call_ref %__active_lane_pc; 2245 ]; 2246 e; 2247 EXEC = ~EXEC & %1; 2248 $lex_1_else: 2249 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2250 DW_AT_name = "__divergent_lane_pc_1_else"; 2251 DW_AT_location = DIExpression[ 2252 DW_OP_call_ref %__divergent_lane_pc; 2253 DW_OP_addrx &lex_1_end; 2254 DW_OP_stack_value; 2255 DW_OP_LLVM_extend 64, 64; 2256 DW_OP_call_ref %__lex_1_save_exec; 2257 DW_OP_deref_type 64, %__uint_64; 2258 DW_OP_LLVM_select_bit_piece 64, 64; 2259 ]; 2260 ]; 2261 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2262 DW_OP_call_ref %__divergent_lane_pc_1_else; 2263 DW_OP_call_ref %__active_lane_pc; 2264 ]; 2265 f; 2266 EXEC = %1; 2267 $lex_1_end: 2268 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2269 DW_OP_call_ref %__divergent_lane_pc; 2270 DW_OP_call_ref %__active_lane_pc; 2271 ]; 2272 g; 2273 $lex_end: 2274 2275The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2276that are active, with the current program location. 2277 2278Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2279the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2280instruction, location list entries will be created that describe where the 2281artificial variables are allocated at any given program location. The compiler 2282may allocate them to registers or spill them to memory. 2283 2284The DWARF procedures for each region use the values of the saved execution mask 2285artificial variables to only update the lanes that are active on entry to the 2286region. All other lanes retain the value of the enclosing region where they were 2287last active. If they were not active on entry to the subprogram, then will have 2288the undefined location description. 2289 2290Other structured control flow regions can be handled similarly. For example, 2291loops would set the divergent program location for the region at the end of the 2292loop. Any lanes active will be in the loop, and any lanes not active must have 2293exited the loop. 2294 2295An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2296``IF/THEN/ELSE`` regions. 2297 2298The DWARF procedures can use the active lane artificial variable described in 2299:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2300``EXEC`` mask in order to support whole or quad wavefront mode. 2301 2302.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2303 2304``DW_AT_LLVM_active_lane`` 2305~~~~~~~~~~~~~~~~~~~~~~~~~~ 2306 2307The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2308entry is used to specify the lanes that are conceptually active for a SIMT 2309thread. 2310 2311The execution mask may be modified to implement whole or quad wavefront mode 2312operations. For example, all lanes may need to temporarily be made active to 2313execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2314update it to enable the necessary lanes, perform the operations, and then 2315restore the ``EXEC`` mask from the saved value. While executing the whole 2316wavefront region, the conceptual execution mask is the saved value, not the 2317``EXEC`` value. 2318 2319This is handled by defining an artificial variable for the active lane mask. The 2320active lane mask artificial variable would be the actual ``EXEC`` mask for 2321normal regions, and the saved execution mask for regions where the mask is 2322temporarily updated. The location list expression created for this artificial 2323variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2324attribute. 2325 2326``DW_AT_LLVM_augmentation`` 2327~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2328 2329For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2330debugger information entry has the following value for the augmentation string: 2331 2332:: 2333 2334 [amdgpu:v0.0] 2335 2336The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2337extensions used in the DWARF of the compilation unit. The version number 2338conforms to [SEMVER]_. 2339 2340Call Frame Information 2341---------------------- 2342 2343DWARF Call Frame Information (CFI) describes how a consumer can virtually 2344*unwind* call frames in a running process or core dump. See DWARF Version 5 2345section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2346 2347For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2348 23491. ``augmentation`` string contains the following null-terminated UTF-8 string: 2350 2351 :: 2352 2353 [amd:v0.0] 2354 2355 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2356 extensions used in this CIE or to the FDEs that use it. The version number 2357 conforms to [SEMVER]_. 2358 23592. ``address_size`` for the ``Global`` address space is defined in 2360 :ref:`amdgpu-dwarf-address-space-identifier`. 2361 23623. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2363 23644. ``code_alignment_factor`` is 4 bytes. 2365 2366 .. TODO:: 2367 2368 Add to :ref:`amdgpu-processor-table` table. 2369 23705. ``data_alignment_factor`` is 4 bytes. 2371 2372 .. TODO:: 2373 2374 Add to :ref:`amdgpu-processor-table` table. 2375 23766. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2377 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2378 23797. ``initial_instructions`` Since a subprogram X with fewer registers can be 2380 called from subprogram Y that has more allocated, X will not change any of 2381 the extra registers as it cannot access them. Therefore, the default rule 2382 for all columns is ``same value``. 2383 2384For AMDGPU the register number follows the numbering defined in 2385:ref:`amdgpu-dwarf-register-identifier`. 2386 2387For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2388the return address to get the address of a byte within the call site 2389instructions. See DWARF Version 5 section 6.4.4. 2390 2391Accelerated Access 2392------------------ 2393 2394See DWARF Version 5 section 6.1. 2395 2396Lookup By Name Section Header 2397~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2398 2399See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2400 2401For AMDGPU the lookup by name section header table: 2402 2403``augmentation_string_size`` (uword) 2404 2405 Set to the length of the ``augmentation_string`` value which is always a 2406 multiple of 4. 2407 2408``augmentation_string`` (sequence of UTF-8 characters) 2409 2410 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2411 2412 :: 2413 2414 [amdgpu:v0.0] 2415 2416 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2417 extensions used in the DWARF of this index. The version number conforms to 2418 [SEMVER]_. 2419 2420 .. note:: 2421 2422 This is different to the DWARF Version 5 definition that requires the first 2423 4 characters to be the vendor ID. But this is consistent with the other 2424 augmentation strings and does allow multiple vendor contributions. However, 2425 backwards compatibility may be more desirable. 2426 2427Lookup By Address Section Header 2428~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2429 2430See DWARF Version 5 section 6.1.2. 2431 2432For AMDGPU the lookup by address section header table: 2433 2434``address_size`` (ubyte) 2435 2436 Match the address size for the ``Global`` address space defined in 2437 :ref:`amdgpu-dwarf-address-space-identifier`. 2438 2439``segment_selector_size`` (ubyte) 2440 2441 AMDGPU does not use a segment selector so this is 0. The entries in the 2442 ``.debug_aranges`` do not have a segment selector. 2443 2444Line Number Information 2445----------------------- 2446 2447See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2448 2449AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2450The instruction set must be obtained from the ELF file header ``e_flags`` field 2451in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2452<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2453 2454.. TODO:: 2455 2456 Should the ``isa`` state machine register be used to indicate if the code is 2457 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2458 2459For AMDGPU the line number program header fields have the following values (see 2460DWARF Version 5 section 6.2.4): 2461 2462``address_size`` (ubyte) 2463 Matches the address size for the ``Global`` address space defined in 2464 :ref:`amdgpu-dwarf-address-space-identifier`. 2465 2466``segment_selector_size`` (ubyte) 2467 AMDGPU does not use a segment selector so this is 0. 2468 2469``minimum_instruction_length`` (ubyte) 2470 For GFX9-GFX10 this is 4. 2471 2472``maximum_operations_per_instruction`` (ubyte) 2473 For GFX9-GFX10 this is 1. 2474 2475Source text for online-compiled programs (for example, those compiled by the 2476OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2477See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2478Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2479<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2480 2481The Clang option used to control source embedding in AMDGPU is defined in 2482:ref:`amdgpu-clang-debug-options-table`. 2483 2484 .. table:: AMDGPU Clang Debug Options 2485 :name: amdgpu-clang-debug-options-table 2486 2487 ==================== ================================================== 2488 Debug Flag Description 2489 ==================== ================================================== 2490 -g[no-]embed-source Enable/disable embedding source text in DWARF 2491 debug sections. Useful for environments where 2492 source cannot be written to disk, such as 2493 when performing online compilation. 2494 ==================== ================================================== 2495 2496For example: 2497 2498``-gembed-source`` 2499 Enable the embedded source. 2500 2501``-gno-embed-source`` 2502 Disable the embedded source. 2503 250432-Bit and 64-Bit DWARF Formats 2505------------------------------- 2506 2507See DWARF Version 5 section 7.4 and 2508:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2509 2510For AMDGPU: 2511 2512* For the ``amdgcn`` target architecture only the 64-bit process address space 2513 is supported. 2514 2515* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2516 the 32-bit DWARF format. 2517 2518Unit Headers 2519------------ 2520 2521For AMDGPU the following values apply for each of the unit headers described in 2522DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2523 2524``address_size`` (ubyte) 2525 Matches the address size for the ``Global`` address space defined in 2526 :ref:`amdgpu-dwarf-address-space-identifier`. 2527 2528.. _amdgpu-code-conventions: 2529 2530Code Conventions 2531================ 2532 2533This section provides code conventions used for each supported target triple OS 2534(see :ref:`amdgpu-target-triples`). 2535 2536AMDHSA 2537------ 2538 2539This section provides code conventions used when the target triple OS is 2540``amdhsa`` (see :ref:`amdgpu-target-triples`). 2541 2542.. _amdgpu-amdhsa-code-object-metadata: 2543 2544Code Object Metadata 2545~~~~~~~~~~~~~~~~~~~~ 2546 2547The code object metadata specifies extensible metadata associated with the code 2548objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2549encoding and semantics of this metadata depends on the code object version; see 2550:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2551:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and 2552:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 2553 2554Code object metadata is specified in a note record (see 2555:ref:`amdgpu-note-records`) and is required when the target triple OS is 2556``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2557information necessary to support the HSA compatible runtime kernel queries. For 2558example, the segment sizes needed in a dispatch packet. In addition, a 2559high-level language runtime may require other information to be included. For 2560example, the AMD OpenCL runtime records kernel argument information. 2561 2562.. _amdgpu-amdhsa-code-object-metadata-v2: 2563 2564Code Object V2 Metadata 2565+++++++++++++++++++++++ 2566 2567.. warning:: 2568 Code object V2 is not the default code object version emitted by this version 2569 of LLVM. 2570 2571Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2572(see :ref:`amdgpu-note-records-v2`). 2573 2574The metadata is specified as a YAML formatted string (see [YAML]_ and 2575:doc:`YamlIO`). 2576 2577.. TODO:: 2578 2579 Is the string null terminated? It probably should not if YAML allows it to 2580 contain null characters, otherwise it should be. 2581 2582The metadata is represented as a single YAML document comprised of the mapping 2583defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2584referenced tables. 2585 2586For boolean values, the string values of ``false`` and ``true`` are used for 2587false and true respectively. 2588 2589Additional information can be added to the mappings. To avoid conflicts, any 2590non-AMD key names should be prefixed by "*vendor-name*.". 2591 2592 .. table:: AMDHSA Code Object V2 Metadata Map 2593 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2594 2595 ========== ============== ========= ======================================= 2596 String Key Value Type Required? Description 2597 ========== ============== ========= ======================================= 2598 "Version" sequence of Required - The first integer is the major 2599 2 integers version. Currently 1. 2600 - The second integer is the minor 2601 version. Currently 0. 2602 "Printf" sequence of Each string is encoded information 2603 strings about a printf function call. The 2604 encoded information is organized as 2605 fields separated by colon (':'): 2606 2607 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2608 2609 where: 2610 2611 ``ID`` 2612 A 32-bit integer as a unique id for 2613 each printf function call 2614 2615 ``N`` 2616 A 32-bit integer equal to the number 2617 of arguments of printf function call 2618 minus 1 2619 2620 ``S[i]`` (where i = 0, 1, ... , N-1) 2621 32-bit integers for the size in bytes 2622 of the i-th FormatString argument of 2623 the printf function call 2624 2625 FormatString 2626 The format string passed to the 2627 printf function call. 2628 "Kernels" sequence of Required Sequence of the mappings for each 2629 mapping kernel in the code object. See 2630 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2631 for the definition of the mapping. 2632 ========== ============== ========= ======================================= 2633 2634.. 2635 2636 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2637 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2638 2639 ================= ============== ========= ================================ 2640 String Key Value Type Required? Description 2641 ================= ============== ========= ================================ 2642 "Name" string Required Source name of the kernel. 2643 "SymbolName" string Required Name of the kernel 2644 descriptor ELF symbol. 2645 "Language" string Source language of the kernel. 2646 Values include: 2647 2648 - "OpenCL C" 2649 - "OpenCL C++" 2650 - "HCC" 2651 - "OpenMP" 2652 2653 "LanguageVersion" sequence of - The first integer is the major 2654 2 integers version. 2655 - The second integer is the 2656 minor version. 2657 "Attrs" mapping Mapping of kernel attributes. 2658 See 2659 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2660 for the mapping definition. 2661 "Args" sequence of Sequence of mappings of the 2662 mapping kernel arguments. See 2663 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2664 for the definition of the mapping. 2665 "CodeProps" mapping Mapping of properties related to 2666 the kernel code. See 2667 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2668 for the mapping definition. 2669 ================= ============== ========= ================================ 2670 2671.. 2672 2673 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2674 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2675 2676 =================== ============== ========= ============================== 2677 String Key Value Type Required? Description 2678 =================== ============== ========= ============================== 2679 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2680 3 integers must be >=1 and the dispatch 2681 work-group size X, Y, Z must 2682 correspond to the specified 2683 values. Defaults to 0, 0, 0. 2684 2685 Corresponds to the OpenCL 2686 ``reqd_work_group_size`` 2687 attribute. 2688 "WorkGroupSizeHint" sequence of The dispatch work-group size 2689 3 integers X, Y, Z is likely to be the 2690 specified values. 2691 2692 Corresponds to the OpenCL 2693 ``work_group_size_hint`` 2694 attribute. 2695 "VecTypeHint" string The name of a scalar or vector 2696 type. 2697 2698 Corresponds to the OpenCL 2699 ``vec_type_hint`` attribute. 2700 2701 "RuntimeHandle" string The external symbol name 2702 associated with a kernel. 2703 OpenCL runtime allocates a 2704 global buffer for the symbol 2705 and saves the kernel's address 2706 to it, which is used for 2707 device side enqueueing. Only 2708 available for device side 2709 enqueued kernels. 2710 =================== ============== ========= ============================== 2711 2712.. 2713 2714 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2715 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2716 2717 ================= ============== ========= ================================ 2718 String Key Value Type Required? Description 2719 ================= ============== ========= ================================ 2720 "Name" string Kernel argument name. 2721 "TypeName" string Kernel argument type name. 2722 "Size" integer Required Kernel argument size in bytes. 2723 "Align" integer Required Kernel argument alignment in 2724 bytes. Must be a power of two. 2725 "ValueKind" string Required Kernel argument kind that 2726 specifies how to set up the 2727 corresponding argument. 2728 Values include: 2729 2730 "ByValue" 2731 The argument is copied 2732 directly into the kernarg. 2733 2734 "GlobalBuffer" 2735 A global address space pointer 2736 to the buffer data is passed 2737 in the kernarg. 2738 2739 "DynamicSharedPointer" 2740 A group address space pointer 2741 to dynamically allocated LDS 2742 is passed in the kernarg. 2743 2744 "Sampler" 2745 A global address space 2746 pointer to a S# is passed in 2747 the kernarg. 2748 2749 "Image" 2750 A global address space 2751 pointer to a T# is passed in 2752 the kernarg. 2753 2754 "Pipe" 2755 A global address space pointer 2756 to an OpenCL pipe is passed in 2757 the kernarg. 2758 2759 "Queue" 2760 A global address space pointer 2761 to an OpenCL device enqueue 2762 queue is passed in the 2763 kernarg. 2764 2765 "HiddenGlobalOffsetX" 2766 The OpenCL grid dispatch 2767 global offset for the X 2768 dimension is passed in the 2769 kernarg. 2770 2771 "HiddenGlobalOffsetY" 2772 The OpenCL grid dispatch 2773 global offset for the Y 2774 dimension is passed in the 2775 kernarg. 2776 2777 "HiddenGlobalOffsetZ" 2778 The OpenCL grid dispatch 2779 global offset for the Z 2780 dimension is passed in the 2781 kernarg. 2782 2783 "HiddenNone" 2784 An argument that is not used 2785 by the kernel. Space needs to 2786 be left for it, but it does 2787 not need to be set up. 2788 2789 "HiddenPrintfBuffer" 2790 A global address space pointer 2791 to the runtime printf buffer 2792 is passed in kernarg. 2793 2794 "HiddenHostcallBuffer" 2795 A global address space pointer 2796 to the runtime hostcall buffer 2797 is passed in kernarg. 2798 2799 "HiddenDefaultQueue" 2800 A global address space pointer 2801 to the OpenCL device enqueue 2802 queue that should be used by 2803 the kernel by default is 2804 passed in the kernarg. 2805 2806 "HiddenCompletionAction" 2807 A global address space pointer 2808 to help link enqueued kernels into 2809 the ancestor tree for determining 2810 when the parent kernel has finished. 2811 2812 "HiddenMultiGridSyncArg" 2813 A global address space pointer for 2814 multi-grid synchronization is 2815 passed in the kernarg. 2816 2817 "ValueType" string Unused and deprecated. This should no longer 2818 be emitted, but is accepted for compatibility. 2819 2820 2821 "PointeeAlign" integer Alignment in bytes of pointee 2822 type for pointer type kernel 2823 argument. Must be a power 2824 of 2. Only present if 2825 "ValueKind" is 2826 "DynamicSharedPointer". 2827 "AddrSpaceQual" string Kernel argument address space 2828 qualifier. Only present if 2829 "ValueKind" is "GlobalBuffer" or 2830 "DynamicSharedPointer". Values 2831 are: 2832 2833 - "Private" 2834 - "Global" 2835 - "Constant" 2836 - "Local" 2837 - "Generic" 2838 - "Region" 2839 2840 .. TODO:: 2841 2842 Is GlobalBuffer only Global 2843 or Constant? Is 2844 DynamicSharedPointer always 2845 Local? Can HCC allow Generic? 2846 How can Private or Region 2847 ever happen? 2848 2849 "AccQual" string Kernel argument access 2850 qualifier. Only present if 2851 "ValueKind" is "Image" or 2852 "Pipe". Values 2853 are: 2854 2855 - "ReadOnly" 2856 - "WriteOnly" 2857 - "ReadWrite" 2858 2859 .. TODO:: 2860 2861 Does this apply to 2862 GlobalBuffer? 2863 2864 "ActualAccQual" string The actual memory accesses 2865 performed by the kernel on the 2866 kernel argument. Only present if 2867 "ValueKind" is "GlobalBuffer", 2868 "Image", or "Pipe". This may be 2869 more restrictive than indicated 2870 by "AccQual" to reflect what the 2871 kernel actual does. If not 2872 present then the runtime must 2873 assume what is implied by 2874 "AccQual" and "IsConst". Values 2875 are: 2876 2877 - "ReadOnly" 2878 - "WriteOnly" 2879 - "ReadWrite" 2880 2881 "IsConst" boolean Indicates if the kernel argument 2882 is const qualified. Only present 2883 if "ValueKind" is 2884 "GlobalBuffer". 2885 2886 "IsRestrict" boolean Indicates if the kernel argument 2887 is restrict qualified. Only 2888 present if "ValueKind" is 2889 "GlobalBuffer". 2890 2891 "IsVolatile" boolean Indicates if the kernel argument 2892 is volatile qualified. Only 2893 present if "ValueKind" is 2894 "GlobalBuffer". 2895 2896 "IsPipe" boolean Indicates if the kernel argument 2897 is pipe qualified. Only present 2898 if "ValueKind" is "Pipe". 2899 2900 .. TODO:: 2901 2902 Can GlobalBuffer be pipe 2903 qualified? 2904 2905 ================= ============== ========= ================================ 2906 2907.. 2908 2909 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2910 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2911 2912 ============================ ============== ========= ===================== 2913 String Key Value Type Required? Description 2914 ============================ ============== ========= ===================== 2915 "KernargSegmentSize" integer Required The size in bytes of 2916 the kernarg segment 2917 that holds the values 2918 of the arguments to 2919 the kernel. 2920 "GroupSegmentFixedSize" integer Required The amount of group 2921 segment memory 2922 required by a 2923 work-group in 2924 bytes. This does not 2925 include any 2926 dynamically allocated 2927 group segment memory 2928 that may be added 2929 when the kernel is 2930 dispatched. 2931 "PrivateSegmentFixedSize" integer Required The amount of fixed 2932 private address space 2933 memory required for a 2934 work-item in 2935 bytes. If the kernel 2936 uses a dynamic call 2937 stack then additional 2938 space must be added 2939 to this value for the 2940 call stack. 2941 "KernargSegmentAlign" integer Required The maximum byte 2942 alignment of 2943 arguments in the 2944 kernarg segment. Must 2945 be a power of 2. 2946 "WavefrontSize" integer Required Wavefront size. Must 2947 be a power of 2. 2948 "NumSGPRs" integer Required Number of scalar 2949 registers used by a 2950 wavefront for 2951 GFX6-GFX10. This 2952 includes the special 2953 SGPRs for VCC, Flat 2954 Scratch (GFX7-GFX10) 2955 and XNACK (for 2956 GFX8-GFX10). It does 2957 not include the 16 2958 SGPR added if a trap 2959 handler is 2960 enabled. It is not 2961 rounded up to the 2962 allocation 2963 granularity. 2964 "NumVGPRs" integer Required Number of vector 2965 registers used by 2966 each work-item for 2967 GFX6-GFX10 2968 "MaxFlatWorkGroupSize" integer Required Maximum flat 2969 work-group size 2970 supported by the 2971 kernel in work-items. 2972 Must be >=1 and 2973 consistent with 2974 ReqdWorkGroupSize if 2975 not 0, 0, 0. 2976 "NumSpilledSGPRs" integer Number of stores from 2977 a scalar register to 2978 a register allocator 2979 created spill 2980 location. 2981 "NumSpilledVGPRs" integer Number of stores from 2982 a vector register to 2983 a register allocator 2984 created spill 2985 location. 2986 ============================ ============== ========= ===================== 2987 2988.. _amdgpu-amdhsa-code-object-metadata-v3: 2989 2990Code Object V3 Metadata 2991+++++++++++++++++++++++ 2992 2993Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note 2994record (see :ref:`amdgpu-note-records-v3-v4`). 2995 2996The metadata is represented as Message Pack formatted binary data (see 2997[MsgPack]_). The top level is a Message Pack map that includes the 2998keys defined in table 2999:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 3000tables. 3001 3002Additional information can be added to the maps. To avoid conflicts, 3003any key names should be prefixed by "*vendor-name*." where 3004``vendor-name`` can be the name of the vendor and specific vendor 3005tool that generates the information. The prefix is abbreviated to 3006simply "." when it appears within a map that has been added by the 3007same *vendor-name*. 3008 3009 .. table:: AMDHSA Code Object V3 Metadata Map 3010 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 3011 3012 ================= ============== ========= ======================================= 3013 String Key Value Type Required? Description 3014 ================= ============== ========= ======================================= 3015 "amdhsa.version" sequence of Required - The first integer is the major 3016 2 integers version. Currently 1. 3017 - The second integer is the minor 3018 version. Currently 0. 3019 "amdhsa.printf" sequence of Each string is encoded information 3020 strings about a printf function call. The 3021 encoded information is organized as 3022 fields separated by colon (':'): 3023 3024 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 3025 3026 where: 3027 3028 ``ID`` 3029 A 32-bit integer as a unique id for 3030 each printf function call 3031 3032 ``N`` 3033 A 32-bit integer equal to the number 3034 of arguments of printf function call 3035 minus 1 3036 3037 ``S[i]`` (where i = 0, 1, ... , N-1) 3038 32-bit integers for the size in bytes 3039 of the i-th FormatString argument of 3040 the printf function call 3041 3042 FormatString 3043 The format string passed to the 3044 printf function call. 3045 "amdhsa.kernels" sequence of Required Sequence of the maps for each 3046 map kernel in the code object. See 3047 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 3048 for the definition of the keys included 3049 in that map. 3050 ================= ============== ========= ======================================= 3051 3052.. 3053 3054 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 3055 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 3056 3057 =================================== ============== ========= ================================ 3058 String Key Value Type Required? Description 3059 =================================== ============== ========= ================================ 3060 ".name" string Required Source name of the kernel. 3061 ".symbol" string Required Name of the kernel 3062 descriptor ELF symbol. 3063 ".language" string Source language of the kernel. 3064 Values include: 3065 3066 - "OpenCL C" 3067 - "OpenCL C++" 3068 - "HCC" 3069 - "HIP" 3070 - "OpenMP" 3071 - "Assembler" 3072 3073 ".language_version" sequence of - The first integer is the major 3074 2 integers version. 3075 - The second integer is the 3076 minor version. 3077 ".args" sequence of Sequence of maps of the 3078 map kernel arguments. See 3079 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3080 for the definition of the keys 3081 included in that map. 3082 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3083 3 integers must be >=1 and the dispatch 3084 work-group size X, Y, Z must 3085 correspond to the specified 3086 values. Defaults to 0, 0, 0. 3087 3088 Corresponds to the OpenCL 3089 ``reqd_work_group_size`` 3090 attribute. 3091 ".workgroup_size_hint" sequence of The dispatch work-group size 3092 3 integers X, Y, Z is likely to be the 3093 specified values. 3094 3095 Corresponds to the OpenCL 3096 ``work_group_size_hint`` 3097 attribute. 3098 ".vec_type_hint" string The name of a scalar or vector 3099 type. 3100 3101 Corresponds to the OpenCL 3102 ``vec_type_hint`` attribute. 3103 3104 ".device_enqueue_symbol" string The external symbol name 3105 associated with a kernel. 3106 OpenCL runtime allocates a 3107 global buffer for the symbol 3108 and saves the kernel's address 3109 to it, which is used for 3110 device side enqueueing. Only 3111 available for device side 3112 enqueued kernels. 3113 ".kernarg_segment_size" integer Required The size in bytes of 3114 the kernarg segment 3115 that holds the values 3116 of the arguments to 3117 the kernel. 3118 ".group_segment_fixed_size" integer Required The amount of group 3119 segment memory 3120 required by a 3121 work-group in 3122 bytes. This does not 3123 include any 3124 dynamically allocated 3125 group segment memory 3126 that may be added 3127 when the kernel is 3128 dispatched. 3129 ".private_segment_fixed_size" integer Required The amount of fixed 3130 private address space 3131 memory required for a 3132 work-item in 3133 bytes. If the kernel 3134 uses a dynamic call 3135 stack then additional 3136 space must be added 3137 to this value for the 3138 call stack. 3139 ".kernarg_segment_align" integer Required The maximum byte 3140 alignment of 3141 arguments in the 3142 kernarg segment. Must 3143 be a power of 2. 3144 ".wavefront_size" integer Required Wavefront size. Must 3145 be a power of 2. 3146 ".sgpr_count" integer Required Number of scalar 3147 registers required by a 3148 wavefront for 3149 GFX6-GFX9. A register 3150 is required if it is 3151 used explicitly, or 3152 if a higher numbered 3153 register is used 3154 explicitly. This 3155 includes the special 3156 SGPRs for VCC, Flat 3157 Scratch (GFX7-GFX9) 3158 and XNACK (for 3159 GFX8-GFX9). It does 3160 not include the 16 3161 SGPR added if a trap 3162 handler is 3163 enabled. It is not 3164 rounded up to the 3165 allocation 3166 granularity. 3167 ".vgpr_count" integer Required Number of vector 3168 registers required by 3169 each work-item for 3170 GFX6-GFX9. A register 3171 is required if it is 3172 used explicitly, or 3173 if a higher numbered 3174 register is used 3175 explicitly. 3176 ".max_flat_workgroup_size" integer Required Maximum flat 3177 work-group size 3178 supported by the 3179 kernel in work-items. 3180 Must be >=1 and 3181 consistent with 3182 ReqdWorkGroupSize if 3183 not 0, 0, 0. 3184 ".sgpr_spill_count" integer Number of stores from 3185 a scalar register to 3186 a register allocator 3187 created spill 3188 location. 3189 ".vgpr_spill_count" integer Number of stores from 3190 a vector register to 3191 a register allocator 3192 created spill 3193 location. 3194 ".kind" string The kind of the kernel 3195 with the following 3196 values: 3197 3198 "normal" 3199 Regular kernels. 3200 3201 "init" 3202 These kernels must be 3203 invoked after loading 3204 the containing code 3205 object and must 3206 complete before any 3207 normal and fini 3208 kernels in the same 3209 code object are 3210 invoked. 3211 3212 "fini" 3213 These kernels must be 3214 invoked before 3215 unloading the 3216 containing code object 3217 and after all init and 3218 normal kernels in the 3219 same code object have 3220 been invoked and 3221 completed. 3222 3223 If omitted, "normal" is 3224 assumed. 3225 =================================== ============== ========= ================================ 3226 3227.. 3228 3229 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3230 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3231 3232 ====================== ============== ========= ================================ 3233 String Key Value Type Required? Description 3234 ====================== ============== ========= ================================ 3235 ".name" string Kernel argument name. 3236 ".type_name" string Kernel argument type name. 3237 ".size" integer Required Kernel argument size in bytes. 3238 ".offset" integer Required Kernel argument offset in 3239 bytes. The offset must be a 3240 multiple of the alignment 3241 required by the argument. 3242 ".value_kind" string Required Kernel argument kind that 3243 specifies how to set up the 3244 corresponding argument. 3245 Values include: 3246 3247 "by_value" 3248 The argument is copied 3249 directly into the kernarg. 3250 3251 "global_buffer" 3252 A global address space pointer 3253 to the buffer data is passed 3254 in the kernarg. 3255 3256 "dynamic_shared_pointer" 3257 A group address space pointer 3258 to dynamically allocated LDS 3259 is passed in the kernarg. 3260 3261 "sampler" 3262 A global address space 3263 pointer to a S# is passed in 3264 the kernarg. 3265 3266 "image" 3267 A global address space 3268 pointer to a T# is passed in 3269 the kernarg. 3270 3271 "pipe" 3272 A global address space pointer 3273 to an OpenCL pipe is passed in 3274 the kernarg. 3275 3276 "queue" 3277 A global address space pointer 3278 to an OpenCL device enqueue 3279 queue is passed in the 3280 kernarg. 3281 3282 "hidden_global_offset_x" 3283 The OpenCL grid dispatch 3284 global offset for the X 3285 dimension is passed in the 3286 kernarg. 3287 3288 "hidden_global_offset_y" 3289 The OpenCL grid dispatch 3290 global offset for the Y 3291 dimension is passed in the 3292 kernarg. 3293 3294 "hidden_global_offset_z" 3295 The OpenCL grid dispatch 3296 global offset for the Z 3297 dimension is passed in the 3298 kernarg. 3299 3300 "hidden_none" 3301 An argument that is not used 3302 by the kernel. Space needs to 3303 be left for it, but it does 3304 not need to be set up. 3305 3306 "hidden_printf_buffer" 3307 A global address space pointer 3308 to the runtime printf buffer 3309 is passed in kernarg. 3310 3311 "hidden_hostcall_buffer" 3312 A global address space pointer 3313 to the runtime hostcall buffer 3314 is passed in kernarg. 3315 3316 "hidden_default_queue" 3317 A global address space pointer 3318 to the OpenCL device enqueue 3319 queue that should be used by 3320 the kernel by default is 3321 passed in the kernarg. 3322 3323 "hidden_completion_action" 3324 A global address space pointer 3325 to help link enqueued kernels into 3326 the ancestor tree for determining 3327 when the parent kernel has finished. 3328 3329 "hidden_multigrid_sync_arg" 3330 A global address space pointer for 3331 multi-grid synchronization is 3332 passed in the kernarg. 3333 3334 ".value_type" string Unused and deprecated. This should no longer 3335 be emitted, but is accepted for compatibility. 3336 3337 ".pointee_align" integer Alignment in bytes of pointee 3338 type for pointer type kernel 3339 argument. Must be a power 3340 of 2. Only present if 3341 ".value_kind" is 3342 "dynamic_shared_pointer". 3343 ".address_space" string Kernel argument address space 3344 qualifier. Only present if 3345 ".value_kind" is "global_buffer" or 3346 "dynamic_shared_pointer". Values 3347 are: 3348 3349 - "private" 3350 - "global" 3351 - "constant" 3352 - "local" 3353 - "generic" 3354 - "region" 3355 3356 .. TODO:: 3357 3358 Is "global_buffer" only "global" 3359 or "constant"? Is 3360 "dynamic_shared_pointer" always 3361 "local"? Can HCC allow "generic"? 3362 How can "private" or "region" 3363 ever happen? 3364 3365 ".access" string Kernel argument access 3366 qualifier. Only present if 3367 ".value_kind" is "image" or 3368 "pipe". Values 3369 are: 3370 3371 - "read_only" 3372 - "write_only" 3373 - "read_write" 3374 3375 .. TODO:: 3376 3377 Does this apply to 3378 "global_buffer"? 3379 3380 ".actual_access" string The actual memory accesses 3381 performed by the kernel on the 3382 kernel argument. Only present if 3383 ".value_kind" is "global_buffer", 3384 "image", or "pipe". This may be 3385 more restrictive than indicated 3386 by ".access" to reflect what the 3387 kernel actual does. If not 3388 present then the runtime must 3389 assume what is implied by 3390 ".access" and ".is_const" . Values 3391 are: 3392 3393 - "read_only" 3394 - "write_only" 3395 - "read_write" 3396 3397 ".is_const" boolean Indicates if the kernel argument 3398 is const qualified. Only present 3399 if ".value_kind" is 3400 "global_buffer". 3401 3402 ".is_restrict" boolean Indicates if the kernel argument 3403 is restrict qualified. Only 3404 present if ".value_kind" is 3405 "global_buffer". 3406 3407 ".is_volatile" boolean Indicates if the kernel argument 3408 is volatile qualified. Only 3409 present if ".value_kind" is 3410 "global_buffer". 3411 3412 ".is_pipe" boolean Indicates if the kernel argument 3413 is pipe qualified. Only present 3414 if ".value_kind" is "pipe". 3415 3416 .. TODO:: 3417 3418 Can "global_buffer" be pipe 3419 qualified? 3420 3421 ====================== ============== ========= ================================ 3422 3423.. _amdgpu-amdhsa-code-object-metadata-v4: 3424 3425Code Object V4 Metadata 3426+++++++++++++++++++++++ 3427 3428.. warning:: 3429 Code object V4 is not the default code object version emitted by this version 3430 of LLVM. 3431 3432Code object V4 metadata is the same as 3433:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3434defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`. 3435 3436 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3` 3437 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3438 3439 ================= ============== ========= ======================================= 3440 String Key Value Type Required? Description 3441 ================= ============== ========= ======================================= 3442 "amdhsa.version" sequence of Required - The first integer is the major 3443 2 integers version. Currently 1. 3444 - The second integer is the minor 3445 version. Currently 1. 3446 "amdhsa.target" string Required The target name of the code using the syntax: 3447 3448 .. code:: 3449 3450 <target-triple> [ "-" <target-id> ] 3451 3452 A canonical target ID must be 3453 used. See :ref:`amdgpu-target-triples` 3454 and :ref:`amdgpu-target-id`. 3455 ================= ============== ========= ======================================= 3456 3457.. 3458 3459Kernel Dispatch 3460~~~~~~~~~~~~~~~ 3461 3462The HSA architected queuing language (AQL) defines a user space memory interface 3463that can be used to control the dispatch of kernels, in an agent independent 3464way. An agent can have zero or more AQL queues created for it using an HSA 3465compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3466are 64 bytes) can be placed. See the *HSA Platform System Architecture 3467Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3468 3469The packet processor of a kernel agent is responsible for detecting and 3470dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3471packet processor is implemented by the hardware command processor (CP), 3472asynchronous dispatch controller (ADC) and shader processor input controller 3473(SPI). 3474 3475An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3476the kernel mode driver to initialize and register the AQL queue with CP. 3477 3478To dispatch a kernel the following actions are performed. This can occur in the 3479CPU host program, or from an HSA kernel executing on a GPU. 3480 34811. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3482 executed is obtained. 34832. A pointer to the kernel descriptor (see 3484 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3485 It must be for a kernel that is contained in a code object that that was 3486 loaded by an HSA compatible runtime on the kernel agent with which the AQL 3487 queue is associated. 34883. Space is allocated for the kernel arguments using the HSA compatible runtime 3489 allocator for a memory region with the kernarg property for the kernel agent 3490 that will execute the kernel. It must be at least 16-byte aligned. 34914. Kernel argument values are assigned to the kernel argument memory 3492 allocation. The layout is defined in the *HSA Programmer's Language 3493 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3494 kernel argument memory in the same way constant memory is accessed. (Note 3495 that the HSA specification allows an implementation to copy the kernel 3496 argument contents to another location that is accessed by the kernel.) 34975. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3498 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3499 for the packet. The packet must be set up, and the final write must use an 3500 atomic store release to set the packet kind to ensure the packet contents are 3501 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3502 notify the kernel agent that the AQL queue has been updated. These rules, and 3503 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3504 System Architecture Specification* [HSA]_. 35056. A kernel dispatch packet includes information about the actual dispatch, 3506 such as grid and work-group size, together with information from the code 3507 object about the kernel, such as segment sizes. The HSA compatible runtime 3508 queries on the kernel symbol can be used to obtain the code object values 3509 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 35107. CP executes micro-code and is responsible for detecting and setting up the 3511 GPU to execute the wavefronts of a kernel dispatch. 35128. CP ensures that when the a wavefront starts executing the kernel machine 3513 code, the scalar general purpose registers (SGPR) and vector general purpose 3514 registers (VGPR) are set up as required by the machine code. The required 3515 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3516 register state is defined in 3517 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 35189. The prolog of the kernel machine code (see 3519 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3520 before continuing executing the machine code that corresponds to the kernel. 352110. When the kernel dispatch has completed execution, CP signals the completion 3522 signal specified in the kernel dispatch packet if not 0. 3523 3524.. _amdgpu-amdhsa-memory-spaces: 3525 3526Memory Spaces 3527~~~~~~~~~~~~~ 3528 3529The memory space properties are: 3530 3531 .. table:: AMDHSA Memory Spaces 3532 :name: amdgpu-amdhsa-memory-spaces-table 3533 3534 ================= =========== ======== ======= ================== 3535 Memory Space Name HSA Segment Hardware Address NULL Value 3536 Name Name Size 3537 ================= =========== ======== ======= ================== 3538 Private private scratch 32 0x00000000 3539 Local group LDS 32 0xFFFFFFFF 3540 Global global global 64 0x0000000000000000 3541 Constant constant *same as 64 0x0000000000000000 3542 global* 3543 Generic flat flat 64 0x0000000000000000 3544 Region N/A GDS 32 *not implemented 3545 for AMDHSA* 3546 ================= =========== ======== ======= ================== 3547 3548The global and constant memory spaces both use global virtual addresses, which 3549are the same virtual address space used by the CPU. However, some virtual 3550addresses may only be accessible to the CPU, some only accessible by the GPU, 3551and some by both. 3552 3553Using the constant memory space indicates that the data will not change during 3554the execution of the kernel. This allows scalar read instructions to be 3555used. The vector and scalar L1 caches are invalidated of volatile data before 3556each kernel dispatch execution to allow constant memory to change values between 3557kernel dispatches. 3558 3559The local memory space uses the hardware Local Data Store (LDS) which is 3560automatically allocated when the hardware creates work-groups of wavefronts, and 3561freed when all the wavefronts of a work-group have terminated. The data store 3562(DS) instructions can be used to access it. 3563 3564The private memory space uses the hardware scratch memory support. If the kernel 3565uses scratch, then the hardware allocates memory that is accessed using 3566wavefront lane dword (4 byte) interleaving. The mapping used from private 3567address to physical address is: 3568 3569 ``wavefront-scratch-base + 3570 (private-address * wavefront-size * 4) + 3571 (wavefront-lane-id * 4)`` 3572 3573There are different ways that the wavefront scratch base address is determined 3574by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3575memory can be accessed in an interleaved manner using buffer instruction with 3576the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3577instructions, or by flat instructions. If each lane of a wavefront accesses the 3578same private address, the interleaving results in adjacent dwords being accessed 3579and hence requires fewer cache lines to be fetched. Multi-dword access is not 3580supported except by flat and scratch instructions in GFX9-GFX10. 3581 3582The generic address space uses the hardware flat address support available in 3583GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3584local apertures), that are outside the range of addressible global memory, to 3585map from a flat address to a private or local address. 3586 3587FLAT instructions can take a flat address and access global, private (scratch) 3588and group (LDS) memory depending in if the address is within one of the 3589aperture ranges. Flat access to scratch requires hardware aperture setup and 3590setup in the kernel prologue (see 3591:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3592hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3593:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3594 3595To convert between a segment address and a flat address the base address of the 3596apertures address can be used. For GFX7-GFX8 these are available in the 3597:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3598Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3599GFX9-GFX10 the aperture base addresses are directly available as inline constant 3600registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3601address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3602which makes it easier to convert from flat to segment or segment to flat. 3603 3604Image and Samplers 3605~~~~~~~~~~~~~~~~~~ 3606 3607Image and sample handles created by an HSA compatible runtime (see 3608:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3609object respectively. In order to support the HSA ``query_sampler`` operations 3610two extra dwords are used to store the HSA BRIG enumeration values for the 3611queries that are not trivially deducible from the S# representation. 3612 3613HSA Signals 3614~~~~~~~~~~~ 3615 3616HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3617are 64-bit addresses of a structure allocated in memory accessible from both the 3618CPU and GPU. The structure is defined by the runtime and subject to change 3619between releases. For example, see [AMD-ROCm-github]_. 3620 3621.. _amdgpu-amdhsa-hsa-aql-queue: 3622 3623HSA AQL Queue 3624~~~~~~~~~~~~~ 3625 3626The HSA AQL queue structure is defined by an HSA compatible runtime (see 3627:ref:`amdgpu-os`) and subject to change between releases. For example, see 3628[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3629certain language features such as the flat address aperture bases. It also 3630contains fields used by CP such as managing the allocation of scratch memory. 3631 3632.. _amdgpu-amdhsa-kernel-descriptor: 3633 3634Kernel Descriptor 3635~~~~~~~~~~~~~~~~~ 3636 3637A kernel descriptor consists of the information needed by CP to initiate the 3638execution of a kernel, including the entry point address of the machine code 3639that implements the kernel. 3640 3641Code Object V3 Kernel Descriptor 3642++++++++++++++++++++++++++++++++ 3643 3644CP microcode requires the Kernel descriptor to be allocated on 64-byte 3645alignment. 3646 3647The fields used by CP for code objects before V3 also match those specified in 3648:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3649 3650 .. table:: Code Object V3 Kernel Descriptor 3651 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3652 3653 ======= ======= =============================== ============================ 3654 Bits Size Field Name Description 3655 ======= ======= =============================== ============================ 3656 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3657 address space memory 3658 required for a work-group 3659 in bytes. This does not 3660 include any dynamically 3661 allocated local address 3662 space memory that may be 3663 added when the kernel is 3664 dispatched. 3665 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3666 private address space 3667 memory required for a 3668 work-item in bytes. 3669 Additional space may need to 3670 be added to this value if 3671 the call stack has 3672 non-inlined function calls. 3673 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3674 memory pointed to by the 3675 AQL dispatch packet. The 3676 kernarg memory is used to 3677 pass arguments to the 3678 kernel. 3679 3680 * If the kernarg pointer in 3681 the dispatch packet is NULL 3682 then there are no kernel 3683 arguments. 3684 * If the kernarg pointer in 3685 the dispatch packet is 3686 not NULL and this value 3687 is 0 then the kernarg 3688 memory size is 3689 unspecified. 3690 * If the kernarg pointer in 3691 the dispatch packet is 3692 not NULL and this value 3693 is not 0 then the value 3694 specifies the kernarg 3695 memory size in bytes. It 3696 is recommended to provide 3697 a value as it may be used 3698 by CP to optimize making 3699 the kernarg memory 3700 visible to the kernel 3701 code. 3702 3703 127:96 4 bytes Reserved, must be 0. 3704 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3705 negative) from base 3706 address of kernel 3707 descriptor to kernel's 3708 entry point instruction 3709 which must be 256 byte 3710 aligned. 3711 351:272 20 Reserved, must be 0. 3712 bytes 3713 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3714 Reserved, must be 0. 3715 GFX90A 3716 Compute Shader (CS) 3717 program settings used by 3718 CP to set up 3719 ``COMPUTE_PGM_RSRC3`` 3720 configuration 3721 register. See 3722 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3723 GFX10 3724 Compute Shader (CS) 3725 program settings used by 3726 CP to set up 3727 ``COMPUTE_PGM_RSRC3`` 3728 configuration 3729 register. See 3730 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3731 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3732 program settings used by 3733 CP to set up 3734 ``COMPUTE_PGM_RSRC1`` 3735 configuration 3736 register. See 3737 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3738 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3739 program settings used by 3740 CP to set up 3741 ``COMPUTE_PGM_RSRC2`` 3742 configuration 3743 register. See 3744 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3745 458:448 7 bits *See separate bits below.* Enable the setup of the 3746 SGPR user data registers 3747 (see 3748 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3749 3750 The total number of SGPR 3751 user data registers 3752 requested must not exceed 3753 16 and match value in 3754 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3755 Any requests beyond 16 3756 will be ignored. 3757 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3758 _BUFFER column of 3759 :ref:`amdgpu-processor-table` 3760 specifies *Architected flat 3761 scratch* then not supported 3762 and must be 0, 3763 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3764 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3765 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3766 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3767 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3768 column of 3769 :ref:`amdgpu-processor-table` 3770 specifies *Architected flat 3771 scratch* then not supported 3772 and must be 0, 3773 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3774 _SIZE 3775 457:455 3 bits Reserved, must be 0. 3776 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3777 Reserved, must be 0. 3778 GFX10 3779 - If 0 execute in 3780 wavefront size 64 mode. 3781 - If 1 execute in 3782 native wavefront size 3783 32 mode. 3784 463:459 1 bit Reserved, must be 0. 3785 464 1 bit RESERVED_464 Deprecated, must be 0. 3786 467:465 3 bits Reserved, must be 0. 3787 468 1 bit RESERVED_468 Deprecated, must be 0. 3788 469:471 3 bits Reserved, must be 0. 3789 511:472 5 bytes Reserved, must be 0. 3790 512 **Total size 64 bytes.** 3791 ======= ==================================================================== 3792 3793.. 3794 3795 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3796 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3797 3798 ======= ======= =============================== =========================================================================== 3799 Bits Size Field Name Description 3800 ======= ======= =============================== =========================================================================== 3801 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3802 blocks used by each work-item; 3803 granularity is device 3804 specific: 3805 3806 GFX6-GFX9 3807 - vgprs_used 0..256 3808 - max(0, ceil(vgprs_used / 4) - 1) 3809 GFX90A 3810 - vgprs_used 0..512 3811 - vgprs_used = align(arch_vgprs, 4) 3812 + acc_vgprs 3813 - max(0, ceil(vgprs_used / 8) - 1) 3814 GFX10 (wavefront size 64) 3815 - max_vgpr 1..256 3816 - max(0, ceil(vgprs_used / 4) - 1) 3817 GFX10 (wavefront size 32) 3818 - max_vgpr 1..256 3819 - max(0, ceil(vgprs_used / 8) - 1) 3820 3821 Where vgprs_used is defined 3822 as the highest VGPR number 3823 explicitly referenced plus 3824 one. 3825 3826 Used by CP to set up 3827 ``COMPUTE_PGM_RSRC1.VGPRS``. 3828 3829 The 3830 :ref:`amdgpu-assembler` 3831 calculates this 3832 automatically for the 3833 selected processor from 3834 values provided to the 3835 `.amdhsa_kernel` directive 3836 by the 3837 `.amdhsa_next_free_vgpr` 3838 nested directive (see 3839 :ref:`amdhsa-kernel-directives-table`). 3840 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3841 blocks used by a wavefront; 3842 granularity is device 3843 specific: 3844 3845 GFX6-GFX8 3846 - sgprs_used 0..112 3847 - max(0, ceil(sgprs_used / 8) - 1) 3848 GFX9 3849 - sgprs_used 0..112 3850 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3851 GFX10 3852 Reserved, must be 0. 3853 (128 SGPRs always 3854 allocated.) 3855 3856 Where sgprs_used is 3857 defined as the highest 3858 SGPR number explicitly 3859 referenced plus one, plus 3860 a target specific number 3861 of additional special 3862 SGPRs for VCC, 3863 FLAT_SCRATCH (GFX7+) and 3864 XNACK_MASK (GFX8+), and 3865 any additional 3866 target specific 3867 limitations. It does not 3868 include the 16 SGPRs added 3869 if a trap handler is 3870 enabled. 3871 3872 The target specific 3873 limitations and special 3874 SGPR layout are defined in 3875 the hardware 3876 documentation, which can 3877 be found in the 3878 :ref:`amdgpu-processors` 3879 table. 3880 3881 Used by CP to set up 3882 ``COMPUTE_PGM_RSRC1.SGPRS``. 3883 3884 The 3885 :ref:`amdgpu-assembler` 3886 calculates this 3887 automatically for the 3888 selected processor from 3889 values provided to the 3890 `.amdhsa_kernel` directive 3891 by the 3892 `.amdhsa_next_free_sgpr` 3893 and `.amdhsa_reserve_*` 3894 nested directives (see 3895 :ref:`amdhsa-kernel-directives-table`). 3896 11:10 2 bits PRIORITY Must be 0. 3897 3898 Start executing wavefront 3899 at the specified priority. 3900 3901 CP is responsible for 3902 filling in 3903 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3904 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3905 with specified rounding 3906 mode for single (32 3907 bit) floating point 3908 precision floating point 3909 operations. 3910 3911 Floating point rounding 3912 mode values are defined in 3913 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3914 3915 Used by CP to set up 3916 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3917 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3918 with specified rounding 3919 denorm mode for half/double (16 3920 and 64-bit) floating point 3921 precision floating point 3922 operations. 3923 3924 Floating point rounding 3925 mode values are defined in 3926 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3927 3928 Used by CP to set up 3929 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3930 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3931 with specified denorm mode 3932 for single (32 3933 bit) floating point 3934 precision floating point 3935 operations. 3936 3937 Floating point denorm mode 3938 values are defined in 3939 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3940 3941 Used by CP to set up 3942 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3943 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3944 with specified denorm mode 3945 for half/double (16 3946 and 64-bit) floating point 3947 precision floating point 3948 operations. 3949 3950 Floating point denorm mode 3951 values are defined in 3952 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3953 3954 Used by CP to set up 3955 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3956 20 1 bit PRIV Must be 0. 3957 3958 Start executing wavefront 3959 in privilege trap handler 3960 mode. 3961 3962 CP is responsible for 3963 filling in 3964 ``COMPUTE_PGM_RSRC1.PRIV``. 3965 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3966 with DX10 clamp mode 3967 enabled. Used by the vector 3968 ALU to force DX10 style 3969 treatment of NaN's (when 3970 set, clamp NaN to zero, 3971 otherwise pass NaN 3972 through). 3973 3974 Used by CP to set up 3975 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3976 22 1 bit DEBUG_MODE Must be 0. 3977 3978 Start executing wavefront 3979 in single step mode. 3980 3981 CP is responsible for 3982 filling in 3983 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3984 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3985 with IEEE mode 3986 enabled. Floating point 3987 opcodes that support 3988 exception flag gathering 3989 will quiet and propagate 3990 signaling-NaN inputs per 3991 IEEE 754-2008. Min_dx10 and 3992 max_dx10 become IEEE 3993 754-2008 compliant due to 3994 signaling-NaN propagation 3995 and quieting. 3996 3997 Used by CP to set up 3998 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3999 24 1 bit BULKY Must be 0. 4000 4001 Only one work-group allowed 4002 to execute on a compute 4003 unit. 4004 4005 CP is responsible for 4006 filling in 4007 ``COMPUTE_PGM_RSRC1.BULKY``. 4008 25 1 bit CDBG_USER Must be 0. 4009 4010 Flag that can be used to 4011 control debugging code. 4012 4013 CP is responsible for 4014 filling in 4015 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 4016 26 1 bit FP16_OVFL GFX6-GFX8 4017 Reserved, must be 0. 4018 GFX9-GFX10 4019 Wavefront starts execution 4020 with specified fp16 overflow 4021 mode. 4022 4023 - If 0, fp16 overflow generates 4024 +/-INF values. 4025 - If 1, fp16 overflow that is the 4026 result of an +/-INF input value 4027 or divide by 0 produces a +/-INF, 4028 otherwise clamps computed 4029 overflow to +/-MAX_FP16 as 4030 appropriate. 4031 4032 Used by CP to set up 4033 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 4034 28:27 2 bits Reserved, must be 0. 4035 29 1 bit WGP_MODE GFX6-GFX9 4036 Reserved, must be 0. 4037 GFX10 4038 - If 0 execute work-groups in 4039 CU wavefront execution mode. 4040 - If 1 execute work-groups on 4041 in WGP wavefront execution mode. 4042 4043 See :ref:`amdgpu-amdhsa-memory-model`. 4044 4045 Used by CP to set up 4046 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 4047 30 1 bit MEM_ORDERED GFX6-GFX9 4048 Reserved, must be 0. 4049 GFX10 4050 Controls the behavior of the 4051 s_waitcnt's vmcnt and vscnt 4052 counters. 4053 4054 - If 0 vmcnt reports completion 4055 of load and atomic with return 4056 out of order with sample 4057 instructions, and the vscnt 4058 reports the completion of 4059 store and atomic without 4060 return in order. 4061 - If 1 vmcnt reports completion 4062 of load, atomic with return 4063 and sample instructions in 4064 order, and the vscnt reports 4065 the completion of store and 4066 atomic without return in order. 4067 4068 Used by CP to set up 4069 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 4070 31 1 bit FWD_PROGRESS GFX6-GFX9 4071 Reserved, must be 0. 4072 GFX10 4073 - If 0 execute SIMD wavefronts 4074 using oldest first policy. 4075 - If 1 execute SIMD wavefronts to 4076 ensure wavefronts will make some 4077 forward progress. 4078 4079 Used by CP to set up 4080 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 4081 32 **Total size 4 bytes** 4082 ======= =================================================================================================================== 4083 4084.. 4085 4086 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 4087 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 4088 4089 ======= ======= =============================== =========================================================================== 4090 Bits Size Field Name Description 4091 ======= ======= =============================== =========================================================================== 4092 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 4093 private segment. 4094 * If the *Target Properties* 4095 column of 4096 :ref:`amdgpu-processor-table` 4097 does not specify 4098 *Architected flat 4099 scratch* then enable the 4100 setup of the SGPR 4101 wavefront scratch offset 4102 system register (see 4103 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4104 * If the *Target Properties* 4105 column of 4106 :ref:`amdgpu-processor-table` 4107 specifies *Architected 4108 flat scratch* then enable 4109 the setup of the 4110 FLAT_SCRATCH register 4111 pair (see 4112 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4113 4114 Used by CP to set up 4115 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4116 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4117 user data registers 4118 requested. This number must 4119 match the number of user 4120 data registers enabled. 4121 4122 Used by CP to set up 4123 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4124 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4125 4126 This bit represents 4127 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4128 which is set by the CP if 4129 the runtime has installed a 4130 trap handler. 4131 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4132 system SGPR register for 4133 the work-group id in the X 4134 dimension (see 4135 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4136 4137 Used by CP to set up 4138 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4139 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4140 system SGPR register for 4141 the work-group id in the Y 4142 dimension (see 4143 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4144 4145 Used by CP to set up 4146 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4147 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4148 system SGPR register for 4149 the work-group id in the Z 4150 dimension (see 4151 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4152 4153 Used by CP to set up 4154 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4155 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4156 system SGPR register for 4157 work-group information (see 4158 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4159 4160 Used by CP to set up 4161 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4162 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4163 VGPR system registers used 4164 for the work-item ID. 4165 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4166 defines the values. 4167 4168 Used by CP to set up 4169 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4170 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4171 4172 Wavefront starts execution 4173 with address watch 4174 exceptions enabled which 4175 are generated when L1 has 4176 witnessed a thread access 4177 an *address of 4178 interest*. 4179 4180 CP is responsible for 4181 filling in the address 4182 watch bit in 4183 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4184 according to what the 4185 runtime requests. 4186 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4187 4188 Wavefront starts execution 4189 with memory violation 4190 exceptions exceptions 4191 enabled which are generated 4192 when a memory violation has 4193 occurred for this wavefront from 4194 L1 or LDS 4195 (write-to-read-only-memory, 4196 mis-aligned atomic, LDS 4197 address out of range, 4198 illegal address, etc.). 4199 4200 CP sets the memory 4201 violation bit in 4202 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4203 according to what the 4204 runtime requests. 4205 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4206 4207 CP uses the rounded value 4208 from the dispatch packet, 4209 not this value, as the 4210 dispatch may contain 4211 dynamically allocated group 4212 segment memory. CP writes 4213 directly to 4214 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4215 4216 Amount of group segment 4217 (LDS) to allocate for each 4218 work-group. Granularity is 4219 device specific: 4220 4221 GFX6 4222 roundup(lds-size / (64 * 4)) 4223 GFX7-GFX10 4224 roundup(lds-size / (128 * 4)) 4225 4226 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4227 _INVALID_OPERATION with specified exceptions 4228 enabled. 4229 4230 Used by CP to set up 4231 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4232 (set from bits 0..6). 4233 4234 IEEE 754 FP Invalid 4235 Operation 4236 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4237 _SOURCE input operands is a 4238 denormal number 4239 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4240 _DIVISION_BY_ZERO Zero 4241 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4242 _OVERFLOW 4243 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4244 _UNDERFLOW 4245 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4246 _INEXACT 4247 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4248 _ZERO (rcp_iflag_f32 instruction 4249 only) 4250 31 1 bit Reserved, must be 0. 4251 32 **Total size 4 bytes.** 4252 ======= =================================================================================================================== 4253 4254.. 4255 4256 .. table:: compute_pgm_rsrc3 for GFX90A 4257 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4258 4259 ======= ======= =============================== =========================================================================== 4260 Bits Size Field Name Description 4261 ======= ======= =============================== =========================================================================== 4262 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4263 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4264 63 - accum-offset = 256. 4265 6:15 10 Reserved, must be 0. 4266 bits 4267 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4268 launched in the same CU. 4269 - If 1 the waves of a work-group can be 4270 launched in different CUs. The waves 4271 cannot use S_BARRIER or LDS. 4272 17:31 15 Reserved, must be 0. 4273 bits 4274 32 **Total size 4 bytes.** 4275 ======= =================================================================================================================== 4276 4277.. 4278 4279 .. table:: compute_pgm_rsrc3 for GFX10 4280 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4281 4282 ======= ======= =============================== =========================================================================== 4283 Bits Size Field Name Description 4284 ======= ======= =============================== =========================================================================== 4285 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 4286 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 4287 31:4 28 Reserved, must be 0. 4288 bits 4289 32 **Total size 4 bytes.** 4290 ======= =================================================================================================================== 4291 4292.. 4293 4294 .. table:: Floating Point Rounding Mode Enumeration Values 4295 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4296 4297 ====================================== ===== ============================== 4298 Enumeration Name Value Description 4299 ====================================== ===== ============================== 4300 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4301 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4302 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4303 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4304 ====================================== ===== ============================== 4305 4306.. 4307 4308 .. table:: Floating Point Denorm Mode Enumeration Values 4309 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4310 4311 ====================================== ===== ============================== 4312 Enumeration Name Value Description 4313 ====================================== ===== ============================== 4314 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4315 Denorms 4316 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4317 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4318 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4319 ====================================== ===== ============================== 4320 4321.. 4322 4323 .. table:: System VGPR Work-Item ID Enumeration Values 4324 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4325 4326 ======================================== ===== ============================ 4327 Enumeration Name Value Description 4328 ======================================== ===== ============================ 4329 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4330 ID. 4331 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4332 dimensions ID. 4333 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4334 dimensions ID. 4335 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4336 ======================================== ===== ============================ 4337 4338.. _amdgpu-amdhsa-initial-kernel-execution-state: 4339 4340Initial Kernel Execution State 4341~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4342 4343This section defines the register state that will be set up by the packet 4344processor prior to the start of execution of every wavefront. This is limited by 4345the constraints of the hardware controllers of CP/ADC/SPI. 4346 4347The order of the SGPR registers is defined, but the compiler can specify which 4348ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4349fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4350for enabled registers are dense starting at SGPR0: the first enabled register is 4351SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4352an SGPR number. 4353 4354The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4355all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4356using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4357actually initialized. These are then immediately followed by the System SGPRs 4358that are set up by ADC/SPI and can have different values for each wavefront of 4359the grid dispatch. 4360 4361SGPR register initial state is defined in 4362:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4363 4364 .. table:: SGPR Register Set Up Order 4365 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4366 4367 ========== ========================== ====== ============================== 4368 SGPR Order Name Number Description 4369 (kernel descriptor enable of 4370 field) SGPRs 4371 ========== ========================== ====== ============================== 4372 First Private Segment Buffer 4 See 4373 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4374 _segment_buffer) 4375 then Dispatch Ptr 2 64-bit address of AQL dispatch 4376 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4377 actually executing. 4378 then Queue Ptr 2 64-bit address of amd_queue_t 4379 (enable_sgpr_queue_ptr) object for AQL queue on which 4380 the dispatch packet was 4381 queued. 4382 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4383 (enable_sgpr_kernarg segment. This is directly 4384 _segment_ptr) copied from the 4385 kernarg_address in the kernel 4386 dispatch packet. 4387 4388 Having CP load it once avoids 4389 loading it at the beginning of 4390 every wavefront. 4391 then Dispatch Id 2 64-bit Dispatch ID of the 4392 (enable_sgpr_dispatch_id) dispatch packet being 4393 executed. 4394 then Flat Scratch Init 2 See 4395 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4396 _init) 4397 then Private Segment Size 1 The 32-bit byte size of a 4398 (enable_sgpr_private single work-item's memory 4399 _segment_size) allocation. This is the 4400 value from the kernel 4401 dispatch packet Private 4402 Segment Byte Size rounded up 4403 by CP to a multiple of 4404 DWORD. 4405 4406 Having CP load it once avoids 4407 loading it at the beginning of 4408 every wavefront. 4409 4410 This is not used for 4411 GFX7-GFX8 since it is the same 4412 value as the second SGPR of 4413 Flat Scratch Init. However, it 4414 may be needed for GFX9-GFX10 which 4415 changes the meaning of the 4416 Flat Scratch Init value. 4417 then Work-Group Id X 1 32-bit work-group id in X 4418 (enable_sgpr_workgroup_id dimension of grid for 4419 _X) wavefront. 4420 then Work-Group Id Y 1 32-bit work-group id in Y 4421 (enable_sgpr_workgroup_id dimension of grid for 4422 _Y) wavefront. 4423 then Work-Group Id Z 1 32-bit work-group id in Z 4424 (enable_sgpr_workgroup_id dimension of grid for 4425 _Z) wavefront. 4426 then Work-Group Info 1 {first_wavefront, 14'b0000, 4427 (enable_sgpr_workgroup ordered_append_term[10:0], 4428 _info) threadgroup_size_in_wavefronts[5:0]} 4429 then Scratch Wavefront Offset 1 See 4430 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4431 _segment_wavefront_offset) and 4432 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4433 ========== ========================== ====== ============================== 4434 4435The order of the VGPR registers is defined, but the compiler can specify which 4436ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4437fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4438for enabled registers are dense starting at VGPR0: the first enabled register is 4439VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4440VGPR number. 4441 4442There are different methods used for the VGPR initial state: 4443 4444* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4445 specifies otherwise, a separate VGPR register is used per work-item ID. The 4446 VGPR register initial state for this method is defined in 4447 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4448* If *Target Properties* column of :ref:`amdgpu-processor-table` 4449 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4450 for all work-item IDs. The register layout for this method is defined in 4451 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4452 4453 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4454 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4455 4456 ========== ========================== ====== ============================== 4457 VGPR Order Name Number Description 4458 (kernel descriptor enable of 4459 field) VGPRs 4460 ========== ========================== ====== ============================== 4461 First Work-Item Id X 1 32-bit work-item id in X 4462 (Always initialized) dimension of work-group for 4463 wavefront lane. 4464 then Work-Item Id Y 1 32-bit work-item id in Y 4465 (enable_vgpr_workitem_id dimension of work-group for 4466 > 0) wavefront lane. 4467 then Work-Item Id Z 1 32-bit work-item id in Z 4468 (enable_vgpr_workitem_id dimension of work-group for 4469 > 1) wavefront lane. 4470 ========== ========================== ====== ============================== 4471 4472.. 4473 4474 .. table:: Register Layout for Packed Work-Item ID Method 4475 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4476 4477 ======= ======= ================ ========================================= 4478 Bits Size Field Name Description 4479 ======= ======= ================ ========================================= 4480 0:9 10 bits Work-Item Id X Work-item id in X 4481 dimension of work-group for 4482 wavefront lane. 4483 4484 Always initialized. 4485 4486 10:19 10 bits Work-Item Id Y Work-item id in Y 4487 dimension of work-group for 4488 wavefront lane. 4489 4490 Initialized if enable_vgpr_workitem_id > 4491 0, otherwise set to 0. 4492 20:29 10 bits Work-Item Id Z Work-item id in Z 4493 dimension of work-group for 4494 wavefront lane. 4495 4496 Initialized if enable_vgpr_workitem_id > 4497 1, otherwise set to 0. 4498 30:31 2 bits Reserved, set to 0. 4499 ======= ======= ================ ========================================= 4500 4501The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4502 45031. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4504 registers. 45052. Work-group Id registers X, Y, Z are set by ADC which supports any 4506 combination including none. 45073. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4508 its value cannot be included with the flat scratch init value which is per 4509 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 45104. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4511 or (X, Y, Z). 45125. Flat Scratch register pair initialization is described in 4513 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4514 4515The global segment can be accessed either using buffer instructions (GFX6 which 4516has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4517instructions (GFX9-GFX10). 4518 4519If buffer operations are used, then the compiler can generate a V# with the 4520following properties: 4521 4522* base address of 0 4523* no swizzle 4524* ATC: 1 if IOMMU present (such as APU) 4525* ptr64: 1 4526* MTYPE set to support memory coherence that matches the runtime (such as CC for 4527 APU and NC for dGPU). 4528 4529.. _amdgpu-amdhsa-kernel-prolog: 4530 4531Kernel Prolog 4532~~~~~~~~~~~~~ 4533 4534The compiler performs initialization in the kernel prologue depending on the 4535target and information about things like stack usage in the kernel and called 4536functions. Some of this initialization requires the compiler to request certain 4537User and System SGPRs be present in the 4538:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4539:ref:`amdgpu-amdhsa-kernel-descriptor`. 4540 4541.. _amdgpu-amdhsa-kernel-prolog-cfi: 4542 4543CFI 4544+++ 4545 45461. The CFI return address is undefined. 4547 45482. The CFI CFA is defined using an expression which evaluates to a location 4549 description that comprises one memory location description for the 4550 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4551 4552.. _amdgpu-amdhsa-kernel-prolog-m0: 4553 4554M0 4555++ 4556 4557GFX6-GFX8 4558 The M0 register must be initialized with a value at least the total LDS size 4559 if the kernel may access LDS via DS or flat operations. Total LDS size is 4560 available in dispatch packet. For M0, it is also possible to use maximum 4561 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4562 GFX7-GFX8). 4563GFX9-GFX10 4564 The M0 register is not used for range checking LDS accesses and so does not 4565 need to be initialized in the prolog. 4566 4567.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4568 4569Stack Pointer 4570+++++++++++++ 4571 4572If the kernel has function calls it must set up the ABI stack pointer described 4573in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4574SGPR32 to the unswizzled scratch offset of the address past the last local 4575allocation. 4576 4577.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4578 4579Frame Pointer 4580+++++++++++++ 4581 4582If the kernel needs a frame pointer for the reasons defined in 4583``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4584kernel prolog. If a frame pointer is not required then all uses of the frame 4585pointer are replaced with immediate ``0`` offsets. 4586 4587.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4588 4589Flat Scratch 4590++++++++++++ 4591 4592There are different methods used for initializing flat scratch: 4593 4594* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4595 specifies *Does not support generic address space*: 4596 4597 Flat scratch is not supported and there is no flat scratch register pair. 4598 4599* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4600 specifies *Offset flat scratch*: 4601 4602 If the kernel or any function it calls may use flat operations to access 4603 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4604 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4605 Scratch Wavefront Offset SGPR registers (see 4606 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4607 4608 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4609 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4610 being managed by SPI for the queue executing the kernel dispatch. This is 4611 the same value used in the Scratch Segment Buffer V# base address. 4612 4613 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4614 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4615 4616 The prolog must add the value of Scratch Wavefront Offset to get the 4617 wavefront's byte scratch backing memory offset from 4618 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4619 4620 The Scratch Wavefront Offset must also be used as an offset with Private 4621 segment address when using the Scratch Segment Buffer. 4622 4623 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4624 shifted by 8 before moving into FLAT_SCRATCH_HI. 4625 4626 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4627 SGPRn is the highest numbered SGPR allocated to the wavefront). 4628 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4629 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4630 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4631 aperture. 4632 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4633 work-items scratch memory usage. 4634 4635 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4636 checks that the value in the kernel dispatch packet Private Segment Byte 4637 Size is not larger and requests the runtime to increase the queue's scratch 4638 size if necessary. 4639 4640 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4641 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4642 loading it at the beginning of every wavefront. 4643 4644 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4645 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4646 in flat memory instructions. 4647 4648* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4649 specifies *Absolute flat scratch*: 4650 4651 If the kernel or any function it calls may use flat operations to access 4652 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4653 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4654 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4655 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4656 4657 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4658 memory being managed by SPI for the queue executing the kernel dispatch. 4659 4660 CP obtains this from the runtime. 4661 4662 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4663 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4664 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4665 memory instructions. 4666 4667 The Scratch Wavefront Offset must also be used as an offset with Private 4668 segment address when using the Scratch Segment Buffer (see 4669 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4670 4671* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4672 specifies *Architected flat scratch*: 4673 4674 If ENABLE_PRIVATE_SEGMENT is enabled in 4675 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH 4676 register pair will be initialized to the 64-bit address of the base of scratch 4677 backing memory being managed by SPI for the queue executing the kernel 4678 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4679 flat scratch base in flat memory instructions. 4680 4681.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4682 4683Private Segment Buffer 4684++++++++++++++++++++++ 4685 4686If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4687*Architected flat scratch* then a Private Segment Buffer is not supported. 4688Instead the flat SCRATCH instructions are used. 4689 4690Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4691that are used as a V# to access scratch. CP uses the value provided by the 4692runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4693access the private memory space using a segment address. See 4694:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4695 4696The scratch V# is a four-aligned SGPR and always selected for the kernel as 4697follows: 4698 4699 - If it is known during instruction selection that there is stack usage, 4700 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4701 optimizations are disabled (``-O0``), if stack objects already exist (for 4702 locals, etc.), or if there are any function calls. 4703 4704 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4705 are reserved for the tentative scratch V#. These will be used if it is 4706 determined that spilling is needed. 4707 4708 - If no use is made of the tentative scratch V#, then it is unreserved, 4709 and the register count is determined ignoring it. 4710 - If use is made of the tentative scratch V#, then its register numbers 4711 are shifted to the first four-aligned SGPR index after the highest one 4712 allocated by the register allocator, and all uses are updated. The 4713 register count includes them in the shifted location. 4714 - In either case, if the processor has the SGPR allocation bug, the 4715 tentative allocation is not shifted or unreserved in order to ensure 4716 the register count is higher to workaround the bug. 4717 4718 .. note:: 4719 4720 This approach of using a tentative scratch V# and shifting the register 4721 numbers if used avoids having to perform register allocation a second 4722 time if the tentative V# is eliminated. This is more efficient and 4723 avoids the problem that the second register allocation may perform 4724 spilling which will fail as there is no longer a scratch V#. 4725 4726When the kernel prolog code is being emitted it is known whether the scratch V# 4727described above is actually used. If it is, the prolog code must set it up by 4728copying the Private Segment Buffer to the scratch V# registers and then adding 4729the Private Segment Wavefront Offset to the queue base address in the V#. The 4730result is a V# with a base address pointing to the beginning of the wavefront 4731scratch backing memory. 4732 4733The Private Segment Buffer is always requested, but the Private Segment 4734Wavefront Offset is only requested if it is used (see 4735:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4736 4737.. _amdgpu-amdhsa-memory-model: 4738 4739Memory Model 4740~~~~~~~~~~~~ 4741 4742This section describes the mapping of the LLVM memory model onto AMDGPU machine 4743code (see :ref:`memmodel`). 4744 4745The AMDGPU backend supports the memory synchronization scopes specified in 4746:ref:`amdgpu-memory-scopes`. 4747 4748The code sequences used to implement the memory model specify the order of 4749instructions that a single thread must execute. The ``s_waitcnt`` and cache 4750management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4751to other memory instructions executed by the same thread. This allows them to be 4752moved earlier or later which can allow them to be combined with other instances 4753of the same instruction, or hoisted/sunk out of loops to improve performance. 4754Only the instructions related to the memory model are given; additional 4755``s_waitcnt`` instructions are required to ensure registers are defined before 4756being used. These may be able to be combined with the memory model ``s_waitcnt`` 4757instructions as described above. 4758 4759The AMDGPU backend supports the following memory models: 4760 4761 HSA Memory Model [HSA]_ 4762 The HSA memory model uses a single happens-before relation for all address 4763 spaces (see :ref:`amdgpu-address-spaces`). 4764 OpenCL Memory Model [OpenCL]_ 4765 The OpenCL memory model which has separate happens-before relations for the 4766 global and local address spaces. Only a fence specifying both global and 4767 local address space, and seq_cst instructions join the relationships. Since 4768 the LLVM ``memfence`` instruction does not allow an address space to be 4769 specified the OpenCL fence has to conservatively assume both local and 4770 global address space was specified. However, optimizations can often be 4771 done to eliminate the additional ``s_waitcnt`` instructions when there are 4772 no intervening memory instructions which access the corresponding address 4773 space. The code sequences in the table indicate what can be omitted for the 4774 OpenCL memory. The target triple environment is used to determine if the 4775 source language is OpenCL (see :ref:`amdgpu-opencl`). 4776 4777``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4778operations. 4779 4780``buffer/global/flat_load/store/atomic`` instructions to global memory are 4781termed vector memory operations. 4782 4783Private address space uses ``buffer_load/store`` using the scratch V# 4784(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4785is accessing the memory, atomic memory orderings are not meaningful, and all 4786accesses are treated as non-atomic. 4787 4788Constant address space uses ``buffer/global_load`` instructions (or equivalent 4789scalar memory instructions). Since the constant address space contents do not 4790change during the execution of a kernel dispatch it is not legal to perform 4791stores, and atomic memory orderings are not meaningful, and all accesses are 4792treated as non-atomic. 4793 4794A memory synchronization scope wider than work-group is not meaningful for the 4795group (LDS) address space and is treated as work-group. 4796 4797The memory model does not support the region address space which is treated as 4798non-atomic. 4799 4800Acquire memory ordering is not meaningful on store atomic instructions and is 4801treated as non-atomic. 4802 4803Release memory ordering is not meaningful on load atomic instructions and is 4804treated a non-atomic. 4805 4806Acquire-release memory ordering is not meaningful on load or store atomic 4807instructions and is treated as acquire and release respectively. 4808 4809The memory order also adds the single thread optimization constraints defined in 4810table 4811:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4812 4813 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4814 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4815 4816 ============ ============================================================== 4817 LLVM Memory Optimization Constraints 4818 Ordering 4819 ============ ============================================================== 4820 unordered *none* 4821 monotonic *none* 4822 acquire - If a load atomic/atomicrmw then no following load/load 4823 atomic/store/store atomic/atomicrmw/fence instruction can be 4824 moved before the acquire. 4825 - If a fence then same as load atomic, plus no preceding 4826 associated fence-paired-atomic can be moved after the fence. 4827 release - If a store atomic/atomicrmw then no preceding load/load 4828 atomic/store/store atomic/atomicrmw/fence instruction can be 4829 moved after the release. 4830 - If a fence then same as store atomic, plus no following 4831 associated fence-paired-atomic can be moved before the 4832 fence. 4833 acq_rel Same constraints as both acquire and release. 4834 seq_cst - If a load atomic then same constraints as acquire, plus no 4835 preceding sequentially consistent load atomic/store 4836 atomic/atomicrmw/fence instruction can be moved after the 4837 seq_cst. 4838 - If a store atomic then the same constraints as release, plus 4839 no following sequentially consistent load atomic/store 4840 atomic/atomicrmw/fence instruction can be moved before the 4841 seq_cst. 4842 - If an atomicrmw/fence then same constraints as acq_rel. 4843 ============ ============================================================== 4844 4845The code sequences used to implement the memory model are defined in the 4846following sections: 4847 4848* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 4849* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 4850* :ref:`amdgpu-amdhsa-memory-model-gfx10` 4851 4852.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 4853 4854Memory Model GFX6-GFX9 4855++++++++++++++++++++++ 4856 4857For GFX6-GFX9: 4858 4859* Each agent has multiple shader arrays (SA). 4860* Each SA has multiple compute units (CU). 4861* Each CU has multiple SIMDs that execute wavefronts. 4862* The wavefronts for a single work-group are executed in the same CU but may be 4863 executed by different SIMDs. 4864* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4865 executing on it. 4866* All LDS operations of a CU are performed as wavefront wide operations in a 4867 global order and involve no caching. Completion is reported to a wavefront in 4868 execution order. 4869* The LDS memory has multiple request queues shared by the SIMDs of a 4870 CU. Therefore, the LDS operations performed by different wavefronts of a 4871 work-group can be reordered relative to each other, which can result in 4872 reordering the visibility of vector memory operations with respect to LDS 4873 operations of other wavefronts in the same work-group. A ``s_waitcnt 4874 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4875 vector memory operations between wavefronts of a work-group, but not between 4876 operations performed by the same wavefront. 4877* The vector memory operations are performed as wavefront wide operations and 4878 completion is reported to a wavefront in execution order. The exception is 4879 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4880 vector memory order if they access LDS memory, and out of LDS operation order 4881 if they access global memory. 4882* The vector memory operations access a single vector L1 cache shared by all 4883 SIMDs a CU. Therefore, no special action is required for coherence between the 4884 lanes of a single wavefront, or for coherence between wavefronts in the same 4885 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4886 wavefronts executing in different work-groups as they may be executing on 4887 different CUs. 4888* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4889 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4890 scalar operations are used in a restricted way so do not impact the memory 4891 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 4892* The vector and scalar memory operations use an L2 cache shared by all CUs on 4893 the same agent. 4894* The L2 cache has independent channels to service disjoint ranges of virtual 4895 addresses. 4896* Each CU has a separate request queue per channel. Therefore, the vector and 4897 scalar memory operations performed by wavefronts executing in different 4898 work-groups (which may be executing on different CUs) of an agent can be 4899 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4900 ensure synchronization between vector memory operations of different CUs. It 4901 ensures a previous vector memory operation has completed before executing a 4902 subsequent vector memory or LDS operation and so can be used to meet the 4903 requirements of acquire and release. 4904* The L2 cache can be kept coherent with other agents on some targets, or ranges 4905 of virtual addresses can be set up to bypass it to ensure system coherence. 4906 4907Scalar memory operations are only used to access memory that is proven to not 4908change during the execution of the kernel dispatch. This includes constant 4909address space and global address space for program scope ``const`` variables. 4910Therefore, the kernel machine code does not have to maintain the scalar cache to 4911ensure it is coherent with the vector caches. The scalar and vector caches are 4912invalidated between kernel dispatches by CP since constant address space data 4913may change between kernel dispatch executions. See 4914:ref:`amdgpu-amdhsa-memory-spaces`. 4915 4916The one exception is if scalar writes are used to spill SGPR registers. In this 4917case the AMDGPU backend ensures the memory location used to spill is never 4918accessed by vector memory operations at the same time. If scalar writes are used 4919then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4920return since the locations may be used for vector memory instructions by a 4921future wavefront that uses the same scratch area, or a function call that 4922creates a frame at the same address, respectively. There is no need for a 4923``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4924 4925For kernarg backing memory: 4926 4927* CP invalidates the L1 cache at the start of each kernel dispatch. 4928* On dGPU the kernarg backing memory is allocated in host memory accessed as 4929 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 4930 causes it to be treated as non-volatile and so is not invalidated by 4931 ``*_vol``. 4932* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 4933 and so the L2 cache will be coherent with the CPU and other agents. 4934 4935Scratch backing memory (which is used for the private address space) is accessed 4936with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 4937only accessed by a single thread, and is always write-before-read, there is 4938never a need to invalidate these entries from the L1 cache. Hence all cache 4939invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 4940 4941The code sequences used to implement the memory model for GFX6-GFX9 are defined 4942in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 4943 4944 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 4945 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 4946 4947 ============ ============ ============== ========== ================================ 4948 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 4949 Ordering Sync Scope Address GFX6-GFX9 4950 Space 4951 ============ ============ ============== ========== ================================ 4952 **Non-Atomic** 4953 ------------------------------------------------------------------------------------ 4954 load *none* *none* - global - !volatile & !nontemporal 4955 - generic 4956 - private 1. buffer/global/flat_load 4957 - constant 4958 - !volatile & nontemporal 4959 4960 1. buffer/global/flat_load 4961 glc=1 slc=1 4962 4963 - volatile 4964 4965 1. buffer/global/flat_load 4966 glc=1 4967 2. s_waitcnt vmcnt(0) 4968 4969 - Must happen before 4970 any following volatile 4971 global/generic 4972 load/store. 4973 - Ensures that 4974 volatile 4975 operations to 4976 different 4977 addresses will not 4978 be reordered by 4979 hardware. 4980 4981 load *none* *none* - local 1. ds_load 4982 store *none* *none* - global - !volatile & !nontemporal 4983 - generic 4984 - private 1. buffer/global/flat_store 4985 - constant 4986 - !volatile & nontemporal 4987 4988 1. buffer/global/flat_store 4989 glc=1 slc=1 4990 4991 - volatile 4992 4993 1. buffer/global/flat_store 4994 2. s_waitcnt vmcnt(0) 4995 4996 - Must happen before 4997 any following volatile 4998 global/generic 4999 load/store. 5000 - Ensures that 5001 volatile 5002 operations to 5003 different 5004 addresses will not 5005 be reordered by 5006 hardware. 5007 5008 store *none* *none* - local 1. ds_store 5009 **Unordered Atomic** 5010 ------------------------------------------------------------------------------------ 5011 load atomic unordered *any* *any* *Same as non-atomic*. 5012 store atomic unordered *any* *any* *Same as non-atomic*. 5013 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 5014 **Monotonic Atomic** 5015 ------------------------------------------------------------------------------------ 5016 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 5017 - wavefront - local 5018 - workgroup - generic 5019 load atomic monotonic - agent - global 1. buffer/global/flat_load 5020 - system - generic glc=1 5021 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 5022 - wavefront - generic 5023 - workgroup 5024 - agent 5025 - system 5026 store atomic monotonic - singlethread - local 1. ds_store 5027 - wavefront 5028 - workgroup 5029 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 5030 - wavefront - generic 5031 - workgroup 5032 - agent 5033 - system 5034 atomicrmw monotonic - singlethread - local 1. ds_atomic 5035 - wavefront 5036 - workgroup 5037 **Acquire Atomic** 5038 ------------------------------------------------------------------------------------ 5039 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 5040 - wavefront - local 5041 - generic 5042 load atomic acquire - workgroup - global 1. buffer/global_load 5043 load atomic acquire - workgroup - local 1. ds/flat_load 5044 - generic 2. s_waitcnt lgkmcnt(0) 5045 5046 - If OpenCL, omit. 5047 - Must happen before 5048 any following 5049 global/generic 5050 load/load 5051 atomic/store/store 5052 atomic/atomicrmw. 5053 - Ensures any 5054 following global 5055 data read is no 5056 older than a local load 5057 atomic value being 5058 acquired. 5059 5060 load atomic acquire - agent - global 1. buffer/global_load 5061 - system glc=1 5062 2. s_waitcnt vmcnt(0) 5063 5064 - Must happen before 5065 following 5066 buffer_wbinvl1_vol. 5067 - Ensures the load 5068 has completed 5069 before invalidating 5070 the cache. 5071 5072 3. buffer_wbinvl1_vol 5073 5074 - Must happen before 5075 any following 5076 global/generic 5077 load/load 5078 atomic/atomicrmw. 5079 - Ensures that 5080 following 5081 loads will not see 5082 stale global data. 5083 5084 load atomic acquire - agent - generic 1. flat_load glc=1 5085 - system 2. s_waitcnt vmcnt(0) & 5086 lgkmcnt(0) 5087 5088 - If OpenCL omit 5089 lgkmcnt(0). 5090 - Must happen before 5091 following 5092 buffer_wbinvl1_vol. 5093 - Ensures the flat_load 5094 has completed 5095 before invalidating 5096 the cache. 5097 5098 3. buffer_wbinvl1_vol 5099 5100 - Must happen before 5101 any following 5102 global/generic 5103 load/load 5104 atomic/atomicrmw. 5105 - Ensures that 5106 following loads 5107 will not see stale 5108 global data. 5109 5110 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5111 - wavefront - local 5112 - generic 5113 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5114 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5115 - generic 2. s_waitcnt lgkmcnt(0) 5116 5117 - If OpenCL, omit. 5118 - Must happen before 5119 any following 5120 global/generic 5121 load/load 5122 atomic/store/store 5123 atomic/atomicrmw. 5124 - Ensures any 5125 following global 5126 data read is no 5127 older than a local 5128 atomicrmw value 5129 being acquired. 5130 5131 atomicrmw acquire - agent - global 1. buffer/global_atomic 5132 - system 2. s_waitcnt vmcnt(0) 5133 5134 - Must happen before 5135 following 5136 buffer_wbinvl1_vol. 5137 - Ensures the 5138 atomicrmw has 5139 completed before 5140 invalidating the 5141 cache. 5142 5143 3. buffer_wbinvl1_vol 5144 5145 - Must happen before 5146 any following 5147 global/generic 5148 load/load 5149 atomic/atomicrmw. 5150 - Ensures that 5151 following loads 5152 will not see stale 5153 global data. 5154 5155 atomicrmw acquire - agent - generic 1. flat_atomic 5156 - system 2. s_waitcnt vmcnt(0) & 5157 lgkmcnt(0) 5158 5159 - If OpenCL, omit 5160 lgkmcnt(0). 5161 - Must happen before 5162 following 5163 buffer_wbinvl1_vol. 5164 - Ensures the 5165 atomicrmw has 5166 completed before 5167 invalidating the 5168 cache. 5169 5170 3. buffer_wbinvl1_vol 5171 5172 - Must happen before 5173 any following 5174 global/generic 5175 load/load 5176 atomic/atomicrmw. 5177 - Ensures that 5178 following loads 5179 will not see stale 5180 global data. 5181 5182 fence acquire - singlethread *none* *none* 5183 - wavefront 5184 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5185 5186 - If OpenCL and 5187 address space is 5188 not generic, omit. 5189 - However, since LLVM 5190 currently has no 5191 address space on 5192 the fence need to 5193 conservatively 5194 always generate. If 5195 fence had an 5196 address space then 5197 set to address 5198 space of OpenCL 5199 fence flag, or to 5200 generic if both 5201 local and global 5202 flags are 5203 specified. 5204 - Must happen after 5205 any preceding 5206 local/generic load 5207 atomic/atomicrmw 5208 with an equal or 5209 wider sync scope 5210 and memory ordering 5211 stronger than 5212 unordered (this is 5213 termed the 5214 fence-paired-atomic). 5215 - Must happen before 5216 any following 5217 global/generic 5218 load/load 5219 atomic/store/store 5220 atomic/atomicrmw. 5221 - Ensures any 5222 following global 5223 data read is no 5224 older than the 5225 value read by the 5226 fence-paired-atomic. 5227 5228 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5229 - system vmcnt(0) 5230 5231 - If OpenCL and 5232 address space is 5233 not generic, omit 5234 lgkmcnt(0). 5235 - However, since LLVM 5236 currently has no 5237 address space on 5238 the fence need to 5239 conservatively 5240 always generate 5241 (see comment for 5242 previous fence). 5243 - Could be split into 5244 separate s_waitcnt 5245 vmcnt(0) and 5246 s_waitcnt 5247 lgkmcnt(0) to allow 5248 them to be 5249 independently moved 5250 according to the 5251 following rules. 5252 - s_waitcnt vmcnt(0) 5253 must happen after 5254 any preceding 5255 global/generic load 5256 atomic/atomicrmw 5257 with an equal or 5258 wider sync scope 5259 and memory ordering 5260 stronger than 5261 unordered (this is 5262 termed the 5263 fence-paired-atomic). 5264 - s_waitcnt lgkmcnt(0) 5265 must happen after 5266 any preceding 5267 local/generic load 5268 atomic/atomicrmw 5269 with an equal or 5270 wider sync scope 5271 and memory ordering 5272 stronger than 5273 unordered (this is 5274 termed the 5275 fence-paired-atomic). 5276 - Must happen before 5277 the following 5278 buffer_wbinvl1_vol. 5279 - Ensures that the 5280 fence-paired atomic 5281 has completed 5282 before invalidating 5283 the 5284 cache. Therefore 5285 any following 5286 locations read must 5287 be no older than 5288 the value read by 5289 the 5290 fence-paired-atomic. 5291 5292 2. buffer_wbinvl1_vol 5293 5294 - Must happen before any 5295 following global/generic 5296 load/load 5297 atomic/store/store 5298 atomic/atomicrmw. 5299 - Ensures that 5300 following loads 5301 will not see stale 5302 global data. 5303 5304 **Release Atomic** 5305 ------------------------------------------------------------------------------------ 5306 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5307 - wavefront - local 5308 - generic 5309 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5310 - generic 5311 - If OpenCL, omit. 5312 - Must happen after 5313 any preceding 5314 local/generic 5315 load/store/load 5316 atomic/store 5317 atomic/atomicrmw. 5318 - Must happen before 5319 the following 5320 store. 5321 - Ensures that all 5322 memory operations 5323 to local have 5324 completed before 5325 performing the 5326 store that is being 5327 released. 5328 5329 2. buffer/global/flat_store 5330 store atomic release - workgroup - local 1. ds_store 5331 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5332 - system - generic vmcnt(0) 5333 5334 - If OpenCL and 5335 address space is 5336 not generic, omit 5337 lgkmcnt(0). 5338 - Could be split into 5339 separate s_waitcnt 5340 vmcnt(0) and 5341 s_waitcnt 5342 lgkmcnt(0) to allow 5343 them to be 5344 independently moved 5345 according to the 5346 following rules. 5347 - s_waitcnt vmcnt(0) 5348 must happen after 5349 any preceding 5350 global/generic 5351 load/store/load 5352 atomic/store 5353 atomic/atomicrmw. 5354 - s_waitcnt lgkmcnt(0) 5355 must happen after 5356 any preceding 5357 local/generic 5358 load/store/load 5359 atomic/store 5360 atomic/atomicrmw. 5361 - Must happen before 5362 the following 5363 store. 5364 - Ensures that all 5365 memory operations 5366 to memory have 5367 completed before 5368 performing the 5369 store that is being 5370 released. 5371 5372 2. buffer/global/flat_store 5373 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5374 - wavefront - local 5375 - generic 5376 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5377 - generic 5378 - If OpenCL, omit. 5379 - Must happen after 5380 any preceding 5381 local/generic 5382 load/store/load 5383 atomic/store 5384 atomic/atomicrmw. 5385 - Must happen before 5386 the following 5387 atomicrmw. 5388 - Ensures that all 5389 memory operations 5390 to local have 5391 completed before 5392 performing the 5393 atomicrmw that is 5394 being released. 5395 5396 2. buffer/global/flat_atomic 5397 atomicrmw release - workgroup - local 1. ds_atomic 5398 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5399 - system - generic vmcnt(0) 5400 5401 - If OpenCL, omit 5402 lgkmcnt(0). 5403 - Could be split into 5404 separate s_waitcnt 5405 vmcnt(0) and 5406 s_waitcnt 5407 lgkmcnt(0) to allow 5408 them to be 5409 independently moved 5410 according to the 5411 following rules. 5412 - s_waitcnt vmcnt(0) 5413 must happen after 5414 any preceding 5415 global/generic 5416 load/store/load 5417 atomic/store 5418 atomic/atomicrmw. 5419 - s_waitcnt lgkmcnt(0) 5420 must happen after 5421 any preceding 5422 local/generic 5423 load/store/load 5424 atomic/store 5425 atomic/atomicrmw. 5426 - Must happen before 5427 the following 5428 atomicrmw. 5429 - Ensures that all 5430 memory operations 5431 to global and local 5432 have completed 5433 before performing 5434 the atomicrmw that 5435 is being released. 5436 5437 2. buffer/global/flat_atomic 5438 fence release - singlethread *none* *none* 5439 - wavefront 5440 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5441 5442 - If OpenCL and 5443 address space is 5444 not generic, omit. 5445 - However, since LLVM 5446 currently has no 5447 address space on 5448 the fence need to 5449 conservatively 5450 always generate. If 5451 fence had an 5452 address space then 5453 set to address 5454 space of OpenCL 5455 fence flag, or to 5456 generic if both 5457 local and global 5458 flags are 5459 specified. 5460 - Must happen after 5461 any preceding 5462 local/generic 5463 load/load 5464 atomic/store/store 5465 atomic/atomicrmw. 5466 - Must happen before 5467 any following store 5468 atomic/atomicrmw 5469 with an equal or 5470 wider sync scope 5471 and memory ordering 5472 stronger than 5473 unordered (this is 5474 termed the 5475 fence-paired-atomic). 5476 - Ensures that all 5477 memory operations 5478 to local have 5479 completed before 5480 performing the 5481 following 5482 fence-paired-atomic. 5483 5484 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5485 - system vmcnt(0) 5486 5487 - If OpenCL and 5488 address space is 5489 not generic, omit 5490 lgkmcnt(0). 5491 - If OpenCL and 5492 address space is 5493 local, omit 5494 vmcnt(0). 5495 - However, since LLVM 5496 currently has no 5497 address space on 5498 the fence need to 5499 conservatively 5500 always generate. If 5501 fence had an 5502 address space then 5503 set to address 5504 space of OpenCL 5505 fence flag, or to 5506 generic if both 5507 local and global 5508 flags are 5509 specified. 5510 - Could be split into 5511 separate s_waitcnt 5512 vmcnt(0) and 5513 s_waitcnt 5514 lgkmcnt(0) to allow 5515 them to be 5516 independently moved 5517 according to the 5518 following rules. 5519 - s_waitcnt vmcnt(0) 5520 must happen after 5521 any preceding 5522 global/generic 5523 load/store/load 5524 atomic/store 5525 atomic/atomicrmw. 5526 - s_waitcnt lgkmcnt(0) 5527 must happen after 5528 any preceding 5529 local/generic 5530 load/store/load 5531 atomic/store 5532 atomic/atomicrmw. 5533 - Must happen before 5534 any following store 5535 atomic/atomicrmw 5536 with an equal or 5537 wider sync scope 5538 and memory ordering 5539 stronger than 5540 unordered (this is 5541 termed the 5542 fence-paired-atomic). 5543 - Ensures that all 5544 memory operations 5545 have 5546 completed before 5547 performing the 5548 following 5549 fence-paired-atomic. 5550 5551 **Acquire-Release Atomic** 5552 ------------------------------------------------------------------------------------ 5553 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5554 - wavefront - local 5555 - generic 5556 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5557 5558 - If OpenCL, omit. 5559 - Must happen after 5560 any preceding 5561 local/generic 5562 load/store/load 5563 atomic/store 5564 atomic/atomicrmw. 5565 - Must happen before 5566 the following 5567 atomicrmw. 5568 - Ensures that all 5569 memory operations 5570 to local have 5571 completed before 5572 performing the 5573 atomicrmw that is 5574 being released. 5575 5576 2. buffer/global_atomic 5577 5578 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5579 2. s_waitcnt lgkmcnt(0) 5580 5581 - If OpenCL, omit. 5582 - Must happen before 5583 any following 5584 global/generic 5585 load/load 5586 atomic/store/store 5587 atomic/atomicrmw. 5588 - Ensures any 5589 following global 5590 data read is no 5591 older than the local load 5592 atomic value being 5593 acquired. 5594 5595 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5596 5597 - If OpenCL, omit. 5598 - Must happen after 5599 any preceding 5600 local/generic 5601 load/store/load 5602 atomic/store 5603 atomic/atomicrmw. 5604 - Must happen before 5605 the following 5606 atomicrmw. 5607 - Ensures that all 5608 memory operations 5609 to local have 5610 completed before 5611 performing the 5612 atomicrmw that is 5613 being released. 5614 5615 2. flat_atomic 5616 3. s_waitcnt lgkmcnt(0) 5617 5618 - If OpenCL, omit. 5619 - Must happen before 5620 any following 5621 global/generic 5622 load/load 5623 atomic/store/store 5624 atomic/atomicrmw. 5625 - Ensures any 5626 following global 5627 data read is no 5628 older than a local load 5629 atomic value being 5630 acquired. 5631 5632 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5633 - system vmcnt(0) 5634 5635 - If OpenCL, omit 5636 lgkmcnt(0). 5637 - Could be split into 5638 separate s_waitcnt 5639 vmcnt(0) and 5640 s_waitcnt 5641 lgkmcnt(0) to allow 5642 them to be 5643 independently moved 5644 according to the 5645 following rules. 5646 - s_waitcnt vmcnt(0) 5647 must happen after 5648 any preceding 5649 global/generic 5650 load/store/load 5651 atomic/store 5652 atomic/atomicrmw. 5653 - s_waitcnt lgkmcnt(0) 5654 must happen after 5655 any preceding 5656 local/generic 5657 load/store/load 5658 atomic/store 5659 atomic/atomicrmw. 5660 - Must happen before 5661 the following 5662 atomicrmw. 5663 - Ensures that all 5664 memory operations 5665 to global have 5666 completed before 5667 performing the 5668 atomicrmw that is 5669 being released. 5670 5671 2. buffer/global_atomic 5672 3. s_waitcnt vmcnt(0) 5673 5674 - Must happen before 5675 following 5676 buffer_wbinvl1_vol. 5677 - Ensures the 5678 atomicrmw has 5679 completed before 5680 invalidating the 5681 cache. 5682 5683 4. buffer_wbinvl1_vol 5684 5685 - Must happen before 5686 any following 5687 global/generic 5688 load/load 5689 atomic/atomicrmw. 5690 - Ensures that 5691 following loads 5692 will not see stale 5693 global data. 5694 5695 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5696 - system vmcnt(0) 5697 5698 - If OpenCL, omit 5699 lgkmcnt(0). 5700 - Could be split into 5701 separate s_waitcnt 5702 vmcnt(0) and 5703 s_waitcnt 5704 lgkmcnt(0) to allow 5705 them to be 5706 independently moved 5707 according to the 5708 following rules. 5709 - s_waitcnt vmcnt(0) 5710 must happen after 5711 any preceding 5712 global/generic 5713 load/store/load 5714 atomic/store 5715 atomic/atomicrmw. 5716 - s_waitcnt lgkmcnt(0) 5717 must happen after 5718 any preceding 5719 local/generic 5720 load/store/load 5721 atomic/store 5722 atomic/atomicrmw. 5723 - Must happen before 5724 the following 5725 atomicrmw. 5726 - Ensures that all 5727 memory operations 5728 to global have 5729 completed before 5730 performing the 5731 atomicrmw that is 5732 being released. 5733 5734 2. flat_atomic 5735 3. s_waitcnt vmcnt(0) & 5736 lgkmcnt(0) 5737 5738 - If OpenCL, omit 5739 lgkmcnt(0). 5740 - Must happen before 5741 following 5742 buffer_wbinvl1_vol. 5743 - Ensures the 5744 atomicrmw has 5745 completed before 5746 invalidating the 5747 cache. 5748 5749 4. buffer_wbinvl1_vol 5750 5751 - Must happen before 5752 any following 5753 global/generic 5754 load/load 5755 atomic/atomicrmw. 5756 - Ensures that 5757 following loads 5758 will not see stale 5759 global data. 5760 5761 fence acq_rel - singlethread *none* *none* 5762 - wavefront 5763 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5764 5765 - If OpenCL and 5766 address space is 5767 not generic, omit. 5768 - However, 5769 since LLVM 5770 currently has no 5771 address space on 5772 the fence need to 5773 conservatively 5774 always generate 5775 (see comment for 5776 previous fence). 5777 - Must happen after 5778 any preceding 5779 local/generic 5780 load/load 5781 atomic/store/store 5782 atomic/atomicrmw. 5783 - Must happen before 5784 any following 5785 global/generic 5786 load/load 5787 atomic/store/store 5788 atomic/atomicrmw. 5789 - Ensures that all 5790 memory operations 5791 to local have 5792 completed before 5793 performing any 5794 following global 5795 memory operations. 5796 - Ensures that the 5797 preceding 5798 local/generic load 5799 atomic/atomicrmw 5800 with an equal or 5801 wider sync scope 5802 and memory ordering 5803 stronger than 5804 unordered (this is 5805 termed the 5806 acquire-fence-paired-atomic) 5807 has completed 5808 before following 5809 global memory 5810 operations. This 5811 satisfies the 5812 requirements of 5813 acquire. 5814 - Ensures that all 5815 previous memory 5816 operations have 5817 completed before a 5818 following 5819 local/generic store 5820 atomic/atomicrmw 5821 with an equal or 5822 wider sync scope 5823 and memory ordering 5824 stronger than 5825 unordered (this is 5826 termed the 5827 release-fence-paired-atomic). 5828 This satisfies the 5829 requirements of 5830 release. 5831 5832 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 5833 - system vmcnt(0) 5834 5835 - If OpenCL and 5836 address space is 5837 not generic, omit 5838 lgkmcnt(0). 5839 - However, since LLVM 5840 currently has no 5841 address space on 5842 the fence need to 5843 conservatively 5844 always generate 5845 (see comment for 5846 previous fence). 5847 - Could be split into 5848 separate s_waitcnt 5849 vmcnt(0) and 5850 s_waitcnt 5851 lgkmcnt(0) to allow 5852 them to be 5853 independently moved 5854 according to the 5855 following rules. 5856 - s_waitcnt vmcnt(0) 5857 must happen after 5858 any preceding 5859 global/generic 5860 load/store/load 5861 atomic/store 5862 atomic/atomicrmw. 5863 - s_waitcnt lgkmcnt(0) 5864 must happen after 5865 any preceding 5866 local/generic 5867 load/store/load 5868 atomic/store 5869 atomic/atomicrmw. 5870 - Must happen before 5871 the following 5872 buffer_wbinvl1_vol. 5873 - Ensures that the 5874 preceding 5875 global/local/generic 5876 load 5877 atomic/atomicrmw 5878 with an equal or 5879 wider sync scope 5880 and memory ordering 5881 stronger than 5882 unordered (this is 5883 termed the 5884 acquire-fence-paired-atomic) 5885 has completed 5886 before invalidating 5887 the cache. This 5888 satisfies the 5889 requirements of 5890 acquire. 5891 - Ensures that all 5892 previous memory 5893 operations have 5894 completed before a 5895 following 5896 global/local/generic 5897 store 5898 atomic/atomicrmw 5899 with an equal or 5900 wider sync scope 5901 and memory ordering 5902 stronger than 5903 unordered (this is 5904 termed the 5905 release-fence-paired-atomic). 5906 This satisfies the 5907 requirements of 5908 release. 5909 5910 2. buffer_wbinvl1_vol 5911 5912 - Must happen before 5913 any following 5914 global/generic 5915 load/load 5916 atomic/store/store 5917 atomic/atomicrmw. 5918 - Ensures that 5919 following loads 5920 will not see stale 5921 global data. This 5922 satisfies the 5923 requirements of 5924 acquire. 5925 5926 **Sequential Consistent Atomic** 5927 ------------------------------------------------------------------------------------ 5928 load atomic seq_cst - singlethread - global *Same as corresponding 5929 - wavefront - local load atomic acquire, 5930 - generic except must generate 5931 all instructions even 5932 for OpenCL.* 5933 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 5934 - generic 5935 5936 - Must 5937 happen after 5938 preceding 5939 local/generic load 5940 atomic/store 5941 atomic/atomicrmw 5942 with memory 5943 ordering of seq_cst 5944 and with equal or 5945 wider sync scope. 5946 (Note that seq_cst 5947 fences have their 5948 own s_waitcnt 5949 lgkmcnt(0) and so do 5950 not need to be 5951 considered.) 5952 - Ensures any 5953 preceding 5954 sequential 5955 consistent local 5956 memory instructions 5957 have completed 5958 before executing 5959 this sequentially 5960 consistent 5961 instruction. This 5962 prevents reordering 5963 a seq_cst store 5964 followed by a 5965 seq_cst load. (Note 5966 that seq_cst is 5967 stronger than 5968 acquire/release as 5969 the reordering of 5970 load acquire 5971 followed by a store 5972 release is 5973 prevented by the 5974 s_waitcnt of 5975 the release, but 5976 there is nothing 5977 preventing a store 5978 release followed by 5979 load acquire from 5980 completing out of 5981 order. The s_waitcnt 5982 could be placed after 5983 seq_store or before 5984 the seq_load. We 5985 choose the load to 5986 make the s_waitcnt be 5987 as late as possible 5988 so that the store 5989 may have already 5990 completed.) 5991 5992 2. *Following 5993 instructions same as 5994 corresponding load 5995 atomic acquire, 5996 except must generate 5997 all instructions even 5998 for OpenCL.* 5999 load atomic seq_cst - workgroup - local *Same as corresponding 6000 load atomic acquire, 6001 except must generate 6002 all instructions even 6003 for OpenCL.* 6004 6005 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 6006 - system - generic vmcnt(0) 6007 6008 - Could be split into 6009 separate s_waitcnt 6010 vmcnt(0) 6011 and s_waitcnt 6012 lgkmcnt(0) to allow 6013 them to be 6014 independently moved 6015 according to the 6016 following rules. 6017 - s_waitcnt lgkmcnt(0) 6018 must happen after 6019 preceding 6020 global/generic load 6021 atomic/store 6022 atomic/atomicrmw 6023 with memory 6024 ordering of seq_cst 6025 and with equal or 6026 wider sync scope. 6027 (Note that seq_cst 6028 fences have their 6029 own s_waitcnt 6030 lgkmcnt(0) and so do 6031 not need to be 6032 considered.) 6033 - s_waitcnt vmcnt(0) 6034 must happen after 6035 preceding 6036 global/generic load 6037 atomic/store 6038 atomic/atomicrmw 6039 with memory 6040 ordering of seq_cst 6041 and with equal or 6042 wider sync scope. 6043 (Note that seq_cst 6044 fences have their 6045 own s_waitcnt 6046 vmcnt(0) and so do 6047 not need to be 6048 considered.) 6049 - Ensures any 6050 preceding 6051 sequential 6052 consistent global 6053 memory instructions 6054 have completed 6055 before executing 6056 this sequentially 6057 consistent 6058 instruction. This 6059 prevents reordering 6060 a seq_cst store 6061 followed by a 6062 seq_cst load. (Note 6063 that seq_cst is 6064 stronger than 6065 acquire/release as 6066 the reordering of 6067 load acquire 6068 followed by a store 6069 release is 6070 prevented by the 6071 s_waitcnt of 6072 the release, but 6073 there is nothing 6074 preventing a store 6075 release followed by 6076 load acquire from 6077 completing out of 6078 order. The s_waitcnt 6079 could be placed after 6080 seq_store or before 6081 the seq_load. We 6082 choose the load to 6083 make the s_waitcnt be 6084 as late as possible 6085 so that the store 6086 may have already 6087 completed.) 6088 6089 2. *Following 6090 instructions same as 6091 corresponding load 6092 atomic acquire, 6093 except must generate 6094 all instructions even 6095 for OpenCL.* 6096 store atomic seq_cst - singlethread - global *Same as corresponding 6097 - wavefront - local store atomic release, 6098 - workgroup - generic except must generate 6099 - agent all instructions even 6100 - system for OpenCL.* 6101 atomicrmw seq_cst - singlethread - global *Same as corresponding 6102 - wavefront - local atomicrmw acq_rel, 6103 - workgroup - generic except must generate 6104 - agent all instructions even 6105 - system for OpenCL.* 6106 fence seq_cst - singlethread *none* *Same as corresponding 6107 - wavefront fence acq_rel, 6108 - workgroup except must generate 6109 - agent all instructions even 6110 - system for OpenCL.* 6111 ============ ============ ============== ========== ================================ 6112 6113.. _amdgpu-amdhsa-memory-model-gfx90a: 6114 6115Memory Model GFX90A 6116+++++++++++++++++++ 6117 6118For GFX90A: 6119 6120* Each agent has multiple shader arrays (SA). 6121* Each SA has multiple compute units (CU). 6122* Each CU has multiple SIMDs that execute wavefronts. 6123* The wavefronts for a single work-group are executed in the same CU but may be 6124 executed by different SIMDs. The exception is when in tgsplit execution mode 6125 when the wavefronts may be executed by different SIMDs in different CUs. 6126* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6127 executing on it. The exception is when in tgsplit execution mode when no LDS 6128 is allocated as wavefronts of the same work-group can be in different CUs. 6129* All LDS operations of a CU are performed as wavefront wide operations in a 6130 global order and involve no caching. Completion is reported to a wavefront in 6131 execution order. 6132* The LDS memory has multiple request queues shared by the SIMDs of a 6133 CU. Therefore, the LDS operations performed by different wavefronts of a 6134 work-group can be reordered relative to each other, which can result in 6135 reordering the visibility of vector memory operations with respect to LDS 6136 operations of other wavefronts in the same work-group. A ``s_waitcnt 6137 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6138 vector memory operations between wavefronts of a work-group, but not between 6139 operations performed by the same wavefront. 6140* The vector memory operations are performed as wavefront wide operations and 6141 completion is reported to a wavefront in execution order. The exception is 6142 that ``flat_load/store/atomic`` instructions can report out of vector memory 6143 order if they access LDS memory, and out of LDS operation order if they access 6144 global memory. 6145* The vector memory operations access a single vector L1 cache shared by all 6146 SIMDs a CU. Therefore: 6147 6148 * No special action is required for coherence between the lanes of a single 6149 wavefront. 6150 6151 * No special action is required for coherence between wavefronts in the same 6152 work-group since they execute on the same CU. The exception is when in 6153 tgsplit execution mode as wavefronts of the same work-group can be in 6154 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6155 the following item. 6156 6157 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6158 executing in different work-groups as they may be executing on different 6159 CUs. 6160 6161* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6162 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6163 scalar operations are used in a restricted way so do not impact the memory 6164 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6165* The vector and scalar memory operations use an L2 cache shared by all CUs on 6166 the same agent. 6167 6168 * The L2 cache has independent channels to service disjoint ranges of virtual 6169 addresses. 6170 * Each CU has a separate request queue per channel. Therefore, the vector and 6171 scalar memory operations performed by wavefronts executing in different 6172 work-groups (which may be executing on different CUs), or the same 6173 work-group if executing in tgsplit mode, of an agent can be reordered 6174 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6175 synchronization between vector memory operations of different CUs. It 6176 ensures a previous vector memory operation has completed before executing a 6177 subsequent vector memory or LDS operation and so can be used to meet the 6178 requirements of acquire and release. 6179 * The L2 cache of one agent can be kept coherent with other agents by: 6180 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 6181 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 6182 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 6183 6184 * Any local memory cache lines will be automatically invalidated by writes 6185 from CUs associated with other L2 caches, or writes from the CPU, due to 6186 the cache probe caused by coherent requests. Coherent requests are caused 6187 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6188 XGMI, and by PCIe requests that are configured to be coherent requests. 6189 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6190 Subsequent access from the GPU will automatically invalidate or writeback 6191 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6192 * Since all work-groups on the same agent share the same L2, no L2 6193 invalidation or writeback is required for coherence. 6194 * To ensure coherence of local and remote memory writes of work-groups in 6195 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 6196 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 6197 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 6198 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 6199 remote fine grain memory) bypasses the L2, so both will never result in 6200 dirty L2 cache lines. 6201 * To ensure coherence of local and remote memory reads of work-groups in 6202 different agents a ``buffer_invl2`` is required. It will invalidate L2 6203 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 6204 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 6205 coarse memory) cause local reads to be invalidated by remote writes with 6206 with the PTE C-bit so these cache lines are not invalidated. Note that 6207 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 6208 never result in L2 cache lines that need to be invalidated. 6209 6210 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6211 MTYPE UC (uncached) which bypasses the L2. 6212 6213Scalar memory operations are only used to access memory that is proven to not 6214change during the execution of the kernel dispatch. This includes constant 6215address space and global address space for program scope ``const`` variables. 6216Therefore, the kernel machine code does not have to maintain the scalar cache to 6217ensure it is coherent with the vector caches. The scalar and vector caches are 6218invalidated between kernel dispatches by CP since constant address space data 6219may change between kernel dispatch executions. See 6220:ref:`amdgpu-amdhsa-memory-spaces`. 6221 6222The one exception is if scalar writes are used to spill SGPR registers. In this 6223case the AMDGPU backend ensures the memory location used to spill is never 6224accessed by vector memory operations at the same time. If scalar writes are used 6225then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6226return since the locations may be used for vector memory instructions by a 6227future wavefront that uses the same scratch area, or a function call that 6228creates a frame at the same address, respectively. There is no need for a 6229``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6230 6231For kernarg backing memory: 6232 6233* CP invalidates the L1 cache at the start of each kernel dispatch. 6234* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6235 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6236 cache. This also causes it to be treated as non-volatile and so is not 6237 invalidated by ``*_vol``. 6238* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6239 so the L2 cache will be coherent with the CPU and other agents. 6240 6241Scratch backing memory (which is used for the private address space) is accessed 6242with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6243only accessed by a single thread, and is always write-before-read, there is 6244never a need to invalidate these entries from the L1 cache. Hence all cache 6245invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6246 6247The code sequences used to implement the memory model for GFX90A are defined 6248in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6249 6250 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6251 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6252 6253 ============ ============ ============== ========== ================================ 6254 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6255 Ordering Sync Scope Address GFX90A 6256 Space 6257 ============ ============ ============== ========== ================================ 6258 **Non-Atomic** 6259 ------------------------------------------------------------------------------------ 6260 load *none* *none* - global - !volatile & !nontemporal 6261 - generic 6262 - private 1. buffer/global/flat_load 6263 - constant 6264 - !volatile & nontemporal 6265 6266 1. buffer/global/flat_load 6267 glc=1 slc=1 6268 6269 - volatile 6270 6271 1. buffer/global/flat_load 6272 glc=1 6273 2. s_waitcnt vmcnt(0) 6274 6275 - Must happen before 6276 any following volatile 6277 global/generic 6278 load/store. 6279 - Ensures that 6280 volatile 6281 operations to 6282 different 6283 addresses will not 6284 be reordered by 6285 hardware. 6286 6287 load *none* *none* - local 1. ds_load 6288 store *none* *none* - global - !volatile & !nontemporal 6289 - generic 6290 - private 1. buffer/global/flat_store 6291 - constant 6292 - !volatile & nontemporal 6293 6294 1. buffer/global/flat_store 6295 glc=1 slc=1 6296 6297 - volatile 6298 6299 1. buffer/global/flat_store 6300 2. s_waitcnt vmcnt(0) 6301 6302 - Must happen before 6303 any following volatile 6304 global/generic 6305 load/store. 6306 - Ensures that 6307 volatile 6308 operations to 6309 different 6310 addresses will not 6311 be reordered by 6312 hardware. 6313 6314 store *none* *none* - local 1. ds_store 6315 **Unordered Atomic** 6316 ------------------------------------------------------------------------------------ 6317 load atomic unordered *any* *any* *Same as non-atomic*. 6318 store atomic unordered *any* *any* *Same as non-atomic*. 6319 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6320 **Monotonic Atomic** 6321 ------------------------------------------------------------------------------------ 6322 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6323 - wavefront - generic 6324 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6325 - generic glc=1 6326 6327 - If not TgSplit execution 6328 mode, omit glc=1. 6329 6330 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6331 - wavefront local address space cannot 6332 - workgroup be used.* 6333 6334 1. ds_load 6335 load atomic monotonic - agent - global 1. buffer/global/flat_load 6336 - generic glc=1 6337 load atomic monotonic - system - global 1. buffer/global/flat_load 6338 - generic glc=1 6339 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6340 - wavefront - generic 6341 - workgroup 6342 - agent 6343 store atomic monotonic - system - global 1. buffer/global/flat_store 6344 - generic 6345 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6346 - wavefront local address space cannot 6347 - workgroup be used.* 6348 6349 1. ds_store 6350 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6351 - wavefront - generic 6352 - workgroup 6353 - agent 6354 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6355 - generic 6356 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6357 - wavefront local address space cannot 6358 - workgroup be used.* 6359 6360 1. ds_atomic 6361 **Acquire Atomic** 6362 ------------------------------------------------------------------------------------ 6363 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6364 - wavefront - local 6365 - generic 6366 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6367 6368 - If not TgSplit execution 6369 mode, omit glc=1. 6370 6371 2. s_waitcnt vmcnt(0) 6372 6373 - If not TgSplit execution 6374 mode, omit. 6375 - Must happen before the 6376 following buffer_wbinvl1_vol. 6377 6378 3. buffer_wbinvl1_vol 6379 6380 - If not TgSplit execution 6381 mode, omit. 6382 - Must happen before 6383 any following 6384 global/generic 6385 load/load 6386 atomic/store/store 6387 atomic/atomicrmw. 6388 - Ensures that 6389 following 6390 loads will not see 6391 stale data. 6392 6393 load atomic acquire - workgroup - local *If TgSplit execution mode, 6394 local address space cannot 6395 be used.* 6396 6397 1. ds_load 6398 2. s_waitcnt lgkmcnt(0) 6399 6400 - If OpenCL, omit. 6401 - Must happen before 6402 any following 6403 global/generic 6404 load/load 6405 atomic/store/store 6406 atomic/atomicrmw. 6407 - Ensures any 6408 following global 6409 data read is no 6410 older than the local load 6411 atomic value being 6412 acquired. 6413 6414 load atomic acquire - workgroup - generic 1. flat_load glc=1 6415 6416 - If not TgSplit execution 6417 mode, omit glc=1. 6418 6419 2. s_waitcnt lgkm/vmcnt(0) 6420 6421 - Use lgkmcnt(0) if not 6422 TgSplit execution mode 6423 and vmcnt(0) if TgSplit 6424 execution mode. 6425 - If OpenCL, omit lgkmcnt(0). 6426 - Must happen before 6427 the following 6428 buffer_wbinvl1_vol and any 6429 following global/generic 6430 load/load 6431 atomic/store/store 6432 atomic/atomicrmw. 6433 - Ensures any 6434 following global 6435 data read is no 6436 older than a local load 6437 atomic value being 6438 acquired. 6439 6440 3. buffer_wbinvl1_vol 6441 6442 - If not TgSplit execution 6443 mode, omit. 6444 - Ensures that 6445 following 6446 loads will not see 6447 stale data. 6448 6449 load atomic acquire - agent - global 1. buffer/global_load 6450 glc=1 6451 2. s_waitcnt vmcnt(0) 6452 6453 - Must happen before 6454 following 6455 buffer_wbinvl1_vol. 6456 - Ensures the load 6457 has completed 6458 before invalidating 6459 the cache. 6460 6461 3. buffer_wbinvl1_vol 6462 6463 - Must happen before 6464 any following 6465 global/generic 6466 load/load 6467 atomic/atomicrmw. 6468 - Ensures that 6469 following 6470 loads will not see 6471 stale global data. 6472 6473 load atomic acquire - system - global 1. buffer/global/flat_load 6474 glc=1 6475 2. s_waitcnt vmcnt(0) 6476 6477 - Must happen before 6478 following buffer_invl2 and 6479 buffer_wbinvl1_vol. 6480 - Ensures the load 6481 has completed 6482 before invalidating 6483 the cache. 6484 6485 3. buffer_invl2; 6486 buffer_wbinvl1_vol 6487 6488 - Must happen before 6489 any following 6490 global/generic 6491 load/load 6492 atomic/atomicrmw. 6493 - Ensures that 6494 following 6495 loads will not see 6496 stale L1 global data, 6497 nor see stale L2 MTYPE 6498 NC global data. 6499 MTYPE RW and CC memory will 6500 never be stale in L2 due to 6501 the memory probes. 6502 6503 load atomic acquire - agent - generic 1. flat_load glc=1 6504 2. s_waitcnt vmcnt(0) & 6505 lgkmcnt(0) 6506 6507 - If TgSplit execution mode, 6508 omit lgkmcnt(0). 6509 - If OpenCL omit 6510 lgkmcnt(0). 6511 - Must happen before 6512 following 6513 buffer_wbinvl1_vol. 6514 - Ensures the flat_load 6515 has completed 6516 before invalidating 6517 the cache. 6518 6519 3. buffer_wbinvl1_vol 6520 6521 - Must happen before 6522 any following 6523 global/generic 6524 load/load 6525 atomic/atomicrmw. 6526 - Ensures that 6527 following loads 6528 will not see stale 6529 global data. 6530 6531 load atomic acquire - system - generic 1. flat_load glc=1 6532 2. s_waitcnt vmcnt(0) & 6533 lgkmcnt(0) 6534 6535 - If TgSplit execution mode, 6536 omit lgkmcnt(0). 6537 - If OpenCL omit 6538 lgkmcnt(0). 6539 - Must happen before 6540 following 6541 buffer_invl2 and 6542 buffer_wbinvl1_vol. 6543 - Ensures the flat_load 6544 has completed 6545 before invalidating 6546 the caches. 6547 6548 3. buffer_invl2; 6549 buffer_wbinvl1_vol 6550 6551 - Must happen before 6552 any following 6553 global/generic 6554 load/load 6555 atomic/atomicrmw. 6556 - Ensures that 6557 following 6558 loads will not see 6559 stale L1 global data, 6560 nor see stale L2 MTYPE 6561 NC global data. 6562 MTYPE RW and CC memory will 6563 never be stale in L2 due to 6564 the memory probes. 6565 6566 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6567 - wavefront - generic 6568 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6569 - wavefront local address space cannot 6570 be used.* 6571 6572 1. ds_atomic 6573 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6574 2. s_waitcnt vmcnt(0) 6575 6576 - If not TgSplit execution 6577 mode, omit. 6578 - Must happen before the 6579 following buffer_wbinvl1_vol. 6580 - Ensures the atomicrmw 6581 has completed 6582 before invalidating 6583 the cache. 6584 6585 3. buffer_wbinvl1_vol 6586 6587 - If not TgSplit execution 6588 mode, omit. 6589 - Must happen before 6590 any following 6591 global/generic 6592 load/load 6593 atomic/atomicrmw. 6594 - Ensures that 6595 following loads 6596 will not see stale 6597 global data. 6598 6599 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6600 local address space cannot 6601 be used.* 6602 6603 1. ds_atomic 6604 2. s_waitcnt lgkmcnt(0) 6605 6606 - If OpenCL, omit. 6607 - Must happen before 6608 any following 6609 global/generic 6610 load/load 6611 atomic/store/store 6612 atomic/atomicrmw. 6613 - Ensures any 6614 following global 6615 data read is no 6616 older than the local 6617 atomicrmw value 6618 being acquired. 6619 6620 atomicrmw acquire - workgroup - generic 1. flat_atomic 6621 2. s_waitcnt lgkm/vmcnt(0) 6622 6623 - Use lgkmcnt(0) if not 6624 TgSplit execution mode 6625 and vmcnt(0) if TgSplit 6626 execution mode. 6627 - If OpenCL, omit lgkmcnt(0). 6628 - Must happen before 6629 the following 6630 buffer_wbinvl1_vol and 6631 any following 6632 global/generic 6633 load/load 6634 atomic/store/store 6635 atomic/atomicrmw. 6636 - Ensures any 6637 following global 6638 data read is no 6639 older than a local 6640 atomicrmw value 6641 being acquired. 6642 6643 3. buffer_wbinvl1_vol 6644 6645 - If not TgSplit execution 6646 mode, omit. 6647 - Ensures that 6648 following 6649 loads will not see 6650 stale data. 6651 6652 atomicrmw acquire - agent - global 1. buffer/global_atomic 6653 2. s_waitcnt vmcnt(0) 6654 6655 - Must happen before 6656 following 6657 buffer_wbinvl1_vol. 6658 - Ensures the 6659 atomicrmw has 6660 completed before 6661 invalidating the 6662 cache. 6663 6664 3. buffer_wbinvl1_vol 6665 6666 - Must happen before 6667 any following 6668 global/generic 6669 load/load 6670 atomic/atomicrmw. 6671 - Ensures that 6672 following loads 6673 will not see stale 6674 global data. 6675 6676 atomicrmw acquire - system - global 1. buffer/global_atomic 6677 2. s_waitcnt vmcnt(0) 6678 6679 - Must happen before 6680 following buffer_invl2 and 6681 buffer_wbinvl1_vol. 6682 - Ensures the 6683 atomicrmw has 6684 completed before 6685 invalidating the 6686 caches. 6687 6688 3. buffer_invl2; 6689 buffer_wbinvl1_vol 6690 6691 - Must happen before 6692 any following 6693 global/generic 6694 load/load 6695 atomic/atomicrmw. 6696 - Ensures that 6697 following 6698 loads will not see 6699 stale L1 global data, 6700 nor see stale L2 MTYPE 6701 NC global data. 6702 MTYPE RW and CC memory will 6703 never be stale in L2 due to 6704 the memory probes. 6705 6706 atomicrmw acquire - agent - generic 1. flat_atomic 6707 2. s_waitcnt vmcnt(0) & 6708 lgkmcnt(0) 6709 6710 - If TgSplit execution mode, 6711 omit lgkmcnt(0). 6712 - If OpenCL, omit 6713 lgkmcnt(0). 6714 - Must happen before 6715 following 6716 buffer_wbinvl1_vol. 6717 - Ensures the 6718 atomicrmw has 6719 completed before 6720 invalidating the 6721 cache. 6722 6723 3. buffer_wbinvl1_vol 6724 6725 - Must happen before 6726 any following 6727 global/generic 6728 load/load 6729 atomic/atomicrmw. 6730 - Ensures that 6731 following loads 6732 will not see stale 6733 global data. 6734 6735 atomicrmw acquire - system - generic 1. flat_atomic 6736 2. s_waitcnt vmcnt(0) & 6737 lgkmcnt(0) 6738 6739 - If TgSplit execution mode, 6740 omit lgkmcnt(0). 6741 - If OpenCL, omit 6742 lgkmcnt(0). 6743 - Must happen before 6744 following 6745 buffer_invl2 and 6746 buffer_wbinvl1_vol. 6747 - Ensures the 6748 atomicrmw has 6749 completed before 6750 invalidating the 6751 caches. 6752 6753 3. buffer_invl2; 6754 buffer_wbinvl1_vol 6755 6756 - Must happen before 6757 any following 6758 global/generic 6759 load/load 6760 atomic/atomicrmw. 6761 - Ensures that 6762 following 6763 loads will not see 6764 stale L1 global data, 6765 nor see stale L2 MTYPE 6766 NC global data. 6767 MTYPE RW and CC memory will 6768 never be stale in L2 due to 6769 the memory probes. 6770 6771 fence acquire - singlethread *none* *none* 6772 - wavefront 6773 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 6774 6775 - Use lgkmcnt(0) if not 6776 TgSplit execution mode 6777 and vmcnt(0) if TgSplit 6778 execution mode. 6779 - If OpenCL and 6780 address space is 6781 not generic, omit 6782 lgkmcnt(0). 6783 - If OpenCL and 6784 address space is 6785 local, omit 6786 vmcnt(0). 6787 - However, since LLVM 6788 currently has no 6789 address space on 6790 the fence need to 6791 conservatively 6792 always generate. If 6793 fence had an 6794 address space then 6795 set to address 6796 space of OpenCL 6797 fence flag, or to 6798 generic if both 6799 local and global 6800 flags are 6801 specified. 6802 - s_waitcnt vmcnt(0) 6803 must happen after 6804 any preceding 6805 global/generic load 6806 atomic/ 6807 atomicrmw 6808 with an equal or 6809 wider sync scope 6810 and memory ordering 6811 stronger than 6812 unordered (this is 6813 termed the 6814 fence-paired-atomic). 6815 - s_waitcnt lgkmcnt(0) 6816 must happen after 6817 any preceding 6818 local/generic load 6819 atomic/atomicrmw 6820 with an equal or 6821 wider sync scope 6822 and memory ordering 6823 stronger than 6824 unordered (this is 6825 termed the 6826 fence-paired-atomic). 6827 - Must happen before 6828 the following 6829 buffer_wbinvl1_vol and 6830 any following 6831 global/generic 6832 load/load 6833 atomic/store/store 6834 atomic/atomicrmw. 6835 - Ensures any 6836 following global 6837 data read is no 6838 older than the 6839 value read by the 6840 fence-paired-atomic. 6841 6842 2. buffer_wbinvl1_vol 6843 6844 - If not TgSplit execution 6845 mode, omit. 6846 - Ensures that 6847 following 6848 loads will not see 6849 stale data. 6850 6851 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6852 vmcnt(0) 6853 6854 - If TgSplit execution mode, 6855 omit lgkmcnt(0). 6856 - If OpenCL and 6857 address space is 6858 not generic, omit 6859 lgkmcnt(0). 6860 - However, since LLVM 6861 currently has no 6862 address space on 6863 the fence need to 6864 conservatively 6865 always generate 6866 (see comment for 6867 previous fence). 6868 - Could be split into 6869 separate s_waitcnt 6870 vmcnt(0) and 6871 s_waitcnt 6872 lgkmcnt(0) to allow 6873 them to be 6874 independently moved 6875 according to the 6876 following rules. 6877 - s_waitcnt vmcnt(0) 6878 must happen after 6879 any preceding 6880 global/generic load 6881 atomic/atomicrmw 6882 with an equal or 6883 wider sync scope 6884 and memory ordering 6885 stronger than 6886 unordered (this is 6887 termed the 6888 fence-paired-atomic). 6889 - s_waitcnt lgkmcnt(0) 6890 must happen after 6891 any preceding 6892 local/generic load 6893 atomic/atomicrmw 6894 with an equal or 6895 wider sync scope 6896 and memory ordering 6897 stronger than 6898 unordered (this is 6899 termed the 6900 fence-paired-atomic). 6901 - Must happen before 6902 the following 6903 buffer_wbinvl1_vol. 6904 - Ensures that the 6905 fence-paired atomic 6906 has completed 6907 before invalidating 6908 the 6909 cache. Therefore 6910 any following 6911 locations read must 6912 be no older than 6913 the value read by 6914 the 6915 fence-paired-atomic. 6916 6917 2. buffer_wbinvl1_vol 6918 6919 - Must happen before any 6920 following global/generic 6921 load/load 6922 atomic/store/store 6923 atomic/atomicrmw. 6924 - Ensures that 6925 following loads 6926 will not see stale 6927 global data. 6928 6929 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 6930 vmcnt(0) 6931 6932 - If TgSplit execution mode, 6933 omit lgkmcnt(0). 6934 - If OpenCL and 6935 address space is 6936 not generic, omit 6937 lgkmcnt(0). 6938 - However, since LLVM 6939 currently has no 6940 address space on 6941 the fence need to 6942 conservatively 6943 always generate 6944 (see comment for 6945 previous fence). 6946 - Could be split into 6947 separate s_waitcnt 6948 vmcnt(0) and 6949 s_waitcnt 6950 lgkmcnt(0) to allow 6951 them to be 6952 independently moved 6953 according to the 6954 following rules. 6955 - s_waitcnt vmcnt(0) 6956 must happen after 6957 any preceding 6958 global/generic load 6959 atomic/atomicrmw 6960 with an equal or 6961 wider sync scope 6962 and memory ordering 6963 stronger than 6964 unordered (this is 6965 termed the 6966 fence-paired-atomic). 6967 - s_waitcnt lgkmcnt(0) 6968 must happen after 6969 any preceding 6970 local/generic load 6971 atomic/atomicrmw 6972 with an equal or 6973 wider sync scope 6974 and memory ordering 6975 stronger than 6976 unordered (this is 6977 termed the 6978 fence-paired-atomic). 6979 - Must happen before 6980 the following buffer_invl2 and 6981 buffer_wbinvl1_vol. 6982 - Ensures that the 6983 fence-paired atomic 6984 has completed 6985 before invalidating 6986 the 6987 cache. Therefore 6988 any following 6989 locations read must 6990 be no older than 6991 the value read by 6992 the 6993 fence-paired-atomic. 6994 6995 2. buffer_invl2; 6996 buffer_wbinvl1_vol 6997 6998 - Must happen before any 6999 following global/generic 7000 load/load 7001 atomic/store/store 7002 atomic/atomicrmw. 7003 - Ensures that 7004 following 7005 loads will not see 7006 stale L1 global data, 7007 nor see stale L2 MTYPE 7008 NC global data. 7009 MTYPE RW and CC memory will 7010 never be stale in L2 due to 7011 the memory probes. 7012 **Release Atomic** 7013 ------------------------------------------------------------------------------------ 7014 store atomic release - singlethread - global 1. buffer/global/flat_store 7015 - wavefront - generic 7016 store atomic release - singlethread - local *If TgSplit execution mode, 7017 - wavefront local address space cannot 7018 be used.* 7019 7020 1. ds_store 7021 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7022 - generic 7023 - Use lgkmcnt(0) if not 7024 TgSplit execution mode 7025 and vmcnt(0) if TgSplit 7026 execution mode. 7027 - If OpenCL, omit lgkmcnt(0). 7028 - s_waitcnt vmcnt(0) 7029 must happen after 7030 any preceding 7031 global/generic load/store/ 7032 load atomic/store atomic/ 7033 atomicrmw. 7034 - s_waitcnt lgkmcnt(0) 7035 must happen after 7036 any preceding 7037 local/generic 7038 load/store/load 7039 atomic/store 7040 atomic/atomicrmw. 7041 - Must happen before 7042 the following 7043 store. 7044 - Ensures that all 7045 memory operations 7046 have 7047 completed before 7048 performing the 7049 store that is being 7050 released. 7051 7052 2. buffer/global/flat_store 7053 store atomic release - workgroup - local *If TgSplit execution mode, 7054 local address space cannot 7055 be used.* 7056 7057 1. ds_store 7058 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 7059 - generic vmcnt(0) 7060 7061 - If TgSplit execution mode, 7062 omit lgkmcnt(0). 7063 - If OpenCL and 7064 address space is 7065 not generic, omit 7066 lgkmcnt(0). 7067 - Could be split into 7068 separate s_waitcnt 7069 vmcnt(0) and 7070 s_waitcnt 7071 lgkmcnt(0) to allow 7072 them to be 7073 independently moved 7074 according to the 7075 following rules. 7076 - s_waitcnt vmcnt(0) 7077 must happen after 7078 any preceding 7079 global/generic 7080 load/store/load 7081 atomic/store 7082 atomic/atomicrmw. 7083 - s_waitcnt lgkmcnt(0) 7084 must happen after 7085 any preceding 7086 local/generic 7087 load/store/load 7088 atomic/store 7089 atomic/atomicrmw. 7090 - Must happen before 7091 the following 7092 store. 7093 - Ensures that all 7094 memory operations 7095 to memory have 7096 completed before 7097 performing the 7098 store that is being 7099 released. 7100 7101 2. buffer/global/flat_store 7102 store atomic release - system - global 1. buffer_wbl2 7103 - generic 7104 - Must happen before 7105 following s_waitcnt. 7106 - Performs L2 writeback to 7107 ensure previous 7108 global/generic 7109 store/atomicrmw are 7110 visible at system scope. 7111 7112 2. s_waitcnt lgkmcnt(0) & 7113 vmcnt(0) 7114 7115 - If TgSplit execution mode, 7116 omit lgkmcnt(0). 7117 - If OpenCL and 7118 address space is 7119 not generic, omit 7120 lgkmcnt(0). 7121 - Could be split into 7122 separate s_waitcnt 7123 vmcnt(0) and 7124 s_waitcnt 7125 lgkmcnt(0) to allow 7126 them to be 7127 independently moved 7128 according to the 7129 following rules. 7130 - s_waitcnt vmcnt(0) 7131 must happen after any 7132 preceding 7133 global/generic 7134 load/store/load 7135 atomic/store 7136 atomic/atomicrmw. 7137 - s_waitcnt lgkmcnt(0) 7138 must happen after any 7139 preceding 7140 local/generic 7141 load/store/load 7142 atomic/store 7143 atomic/atomicrmw. 7144 - Must happen before 7145 the following 7146 store. 7147 - Ensures that all 7148 memory operations 7149 to memory and the L2 7150 writeback have 7151 completed before 7152 performing the 7153 store that is being 7154 released. 7155 7156 3. buffer/global/flat_store 7157 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7158 - wavefront - generic 7159 atomicrmw release - singlethread - local *If TgSplit execution mode, 7160 - wavefront local address space cannot 7161 be used.* 7162 7163 1. ds_atomic 7164 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7165 - generic 7166 - Use lgkmcnt(0) if not 7167 TgSplit execution mode 7168 and vmcnt(0) if TgSplit 7169 execution mode. 7170 - If OpenCL, omit 7171 lgkmcnt(0). 7172 - s_waitcnt vmcnt(0) 7173 must happen after 7174 any preceding 7175 global/generic load/store/ 7176 load atomic/store atomic/ 7177 atomicrmw. 7178 - s_waitcnt lgkmcnt(0) 7179 must happen after 7180 any preceding 7181 local/generic 7182 load/store/load 7183 atomic/store 7184 atomic/atomicrmw. 7185 - Must happen before 7186 the following 7187 atomicrmw. 7188 - Ensures that all 7189 memory operations 7190 have 7191 completed before 7192 performing the 7193 atomicrmw that is 7194 being released. 7195 7196 2. buffer/global/flat_atomic 7197 atomicrmw release - workgroup - local *If TgSplit execution mode, 7198 local address space cannot 7199 be used.* 7200 7201 1. ds_atomic 7202 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7203 - generic vmcnt(0) 7204 7205 - If TgSplit execution mode, 7206 omit lgkmcnt(0). 7207 - If OpenCL, omit 7208 lgkmcnt(0). 7209 - Could be split into 7210 separate s_waitcnt 7211 vmcnt(0) and 7212 s_waitcnt 7213 lgkmcnt(0) to allow 7214 them to be 7215 independently moved 7216 according to the 7217 following rules. 7218 - s_waitcnt vmcnt(0) 7219 must happen after 7220 any preceding 7221 global/generic 7222 load/store/load 7223 atomic/store 7224 atomic/atomicrmw. 7225 - s_waitcnt lgkmcnt(0) 7226 must happen after 7227 any preceding 7228 local/generic 7229 load/store/load 7230 atomic/store 7231 atomic/atomicrmw. 7232 - Must happen before 7233 the following 7234 atomicrmw. 7235 - Ensures that all 7236 memory operations 7237 to global and local 7238 have completed 7239 before performing 7240 the atomicrmw that 7241 is being released. 7242 7243 2. buffer/global/flat_atomic 7244 atomicrmw release - system - global 1. buffer_wbl2 7245 - generic 7246 - Must happen before 7247 following s_waitcnt. 7248 - Performs L2 writeback to 7249 ensure previous 7250 global/generic 7251 store/atomicrmw are 7252 visible at system scope. 7253 7254 2. s_waitcnt lgkmcnt(0) & 7255 vmcnt(0) 7256 7257 - If TgSplit execution mode, 7258 omit lgkmcnt(0). 7259 - If OpenCL, omit 7260 lgkmcnt(0). 7261 - Could be split into 7262 separate s_waitcnt 7263 vmcnt(0) and 7264 s_waitcnt 7265 lgkmcnt(0) to allow 7266 them to be 7267 independently moved 7268 according to the 7269 following rules. 7270 - s_waitcnt vmcnt(0) 7271 must happen after 7272 any preceding 7273 global/generic 7274 load/store/load 7275 atomic/store 7276 atomic/atomicrmw. 7277 - s_waitcnt lgkmcnt(0) 7278 must happen after 7279 any preceding 7280 local/generic 7281 load/store/load 7282 atomic/store 7283 atomic/atomicrmw. 7284 - Must happen before 7285 the following 7286 atomicrmw. 7287 - Ensures that all 7288 memory operations 7289 to memory and the L2 7290 writeback have 7291 completed before 7292 performing the 7293 store that is being 7294 released. 7295 7296 3. buffer/global/flat_atomic 7297 fence release - singlethread *none* *none* 7298 - wavefront 7299 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7300 7301 - Use lgkmcnt(0) if not 7302 TgSplit execution mode 7303 and vmcnt(0) if TgSplit 7304 execution mode. 7305 - If OpenCL and 7306 address space is 7307 not generic, omit 7308 lgkmcnt(0). 7309 - If OpenCL and 7310 address space is 7311 local, omit 7312 vmcnt(0). 7313 - However, since LLVM 7314 currently has no 7315 address space on 7316 the fence need to 7317 conservatively 7318 always generate. If 7319 fence had an 7320 address space then 7321 set to address 7322 space of OpenCL 7323 fence flag, or to 7324 generic if both 7325 local and global 7326 flags are 7327 specified. 7328 - s_waitcnt vmcnt(0) 7329 must happen after 7330 any preceding 7331 global/generic 7332 load/store/ 7333 load atomic/store atomic/ 7334 atomicrmw. 7335 - s_waitcnt lgkmcnt(0) 7336 must happen after 7337 any preceding 7338 local/generic 7339 load/load 7340 atomic/store/store 7341 atomic/atomicrmw. 7342 - Must happen before 7343 any following store 7344 atomic/atomicrmw 7345 with an equal or 7346 wider sync scope 7347 and memory ordering 7348 stronger than 7349 unordered (this is 7350 termed the 7351 fence-paired-atomic). 7352 - Ensures that all 7353 memory operations 7354 have 7355 completed before 7356 performing the 7357 following 7358 fence-paired-atomic. 7359 7360 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7361 vmcnt(0) 7362 7363 - If TgSplit execution mode, 7364 omit lgkmcnt(0). 7365 - If OpenCL and 7366 address space is 7367 not generic, omit 7368 lgkmcnt(0). 7369 - If OpenCL and 7370 address space is 7371 local, omit 7372 vmcnt(0). 7373 - However, since LLVM 7374 currently has no 7375 address space on 7376 the fence need to 7377 conservatively 7378 always generate. If 7379 fence had an 7380 address space then 7381 set to address 7382 space of OpenCL 7383 fence flag, or to 7384 generic if both 7385 local and global 7386 flags are 7387 specified. 7388 - Could be split into 7389 separate s_waitcnt 7390 vmcnt(0) and 7391 s_waitcnt 7392 lgkmcnt(0) to allow 7393 them to be 7394 independently moved 7395 according to the 7396 following rules. 7397 - s_waitcnt vmcnt(0) 7398 must happen after 7399 any preceding 7400 global/generic 7401 load/store/load 7402 atomic/store 7403 atomic/atomicrmw. 7404 - s_waitcnt lgkmcnt(0) 7405 must happen after 7406 any preceding 7407 local/generic 7408 load/store/load 7409 atomic/store 7410 atomic/atomicrmw. 7411 - Must happen before 7412 any following store 7413 atomic/atomicrmw 7414 with an equal or 7415 wider sync scope 7416 and memory ordering 7417 stronger than 7418 unordered (this is 7419 termed the 7420 fence-paired-atomic). 7421 - Ensures that all 7422 memory operations 7423 have 7424 completed before 7425 performing the 7426 following 7427 fence-paired-atomic. 7428 7429 fence release - system *none* 1. buffer_wbl2 7430 7431 - If OpenCL and 7432 address space is 7433 local, omit. 7434 - Must happen before 7435 following s_waitcnt. 7436 - Performs L2 writeback to 7437 ensure previous 7438 global/generic 7439 store/atomicrmw are 7440 visible at system scope. 7441 7442 2. s_waitcnt lgkmcnt(0) & 7443 vmcnt(0) 7444 7445 - If TgSplit execution mode, 7446 omit lgkmcnt(0). 7447 - If OpenCL and 7448 address space is 7449 not generic, omit 7450 lgkmcnt(0). 7451 - If OpenCL and 7452 address space is 7453 local, omit 7454 vmcnt(0). 7455 - However, since LLVM 7456 currently has no 7457 address space on 7458 the fence need to 7459 conservatively 7460 always generate. If 7461 fence had an 7462 address space then 7463 set to address 7464 space of OpenCL 7465 fence flag, or to 7466 generic if both 7467 local and global 7468 flags are 7469 specified. 7470 - Could be split into 7471 separate s_waitcnt 7472 vmcnt(0) and 7473 s_waitcnt 7474 lgkmcnt(0) to allow 7475 them to be 7476 independently moved 7477 according to the 7478 following rules. 7479 - s_waitcnt vmcnt(0) 7480 must happen after 7481 any preceding 7482 global/generic 7483 load/store/load 7484 atomic/store 7485 atomic/atomicrmw. 7486 - s_waitcnt lgkmcnt(0) 7487 must happen after 7488 any preceding 7489 local/generic 7490 load/store/load 7491 atomic/store 7492 atomic/atomicrmw. 7493 - Must happen before 7494 any following store 7495 atomic/atomicrmw 7496 with an equal or 7497 wider sync scope 7498 and memory ordering 7499 stronger than 7500 unordered (this is 7501 termed the 7502 fence-paired-atomic). 7503 - Ensures that all 7504 memory operations 7505 have 7506 completed before 7507 performing the 7508 following 7509 fence-paired-atomic. 7510 7511 **Acquire-Release Atomic** 7512 ------------------------------------------------------------------------------------ 7513 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7514 - wavefront - generic 7515 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7516 - wavefront local address space cannot 7517 be used.* 7518 7519 1. ds_atomic 7520 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7521 7522 - Use lgkmcnt(0) if not 7523 TgSplit execution mode 7524 and vmcnt(0) if TgSplit 7525 execution mode. 7526 - If OpenCL, omit 7527 lgkmcnt(0). 7528 - Must happen after 7529 any preceding 7530 local/generic 7531 load/store/load 7532 atomic/store 7533 atomic/atomicrmw. 7534 - s_waitcnt vmcnt(0) 7535 must happen after 7536 any preceding 7537 global/generic load/store/ 7538 load atomic/store atomic/ 7539 atomicrmw. 7540 - s_waitcnt lgkmcnt(0) 7541 must happen after 7542 any preceding 7543 local/generic 7544 load/store/load 7545 atomic/store 7546 atomic/atomicrmw. 7547 - Must happen before 7548 the following 7549 atomicrmw. 7550 - Ensures that all 7551 memory operations 7552 have 7553 completed before 7554 performing the 7555 atomicrmw that is 7556 being released. 7557 7558 2. buffer/global_atomic 7559 3. s_waitcnt vmcnt(0) 7560 7561 - If not TgSplit execution 7562 mode, omit. 7563 - Must happen before 7564 the following 7565 buffer_wbinvl1_vol. 7566 - Ensures any 7567 following global 7568 data read is no 7569 older than the 7570 atomicrmw value 7571 being acquired. 7572 7573 4. buffer_wbinvl1_vol 7574 7575 - If not TgSplit execution 7576 mode, omit. 7577 - Ensures that 7578 following 7579 loads will not see 7580 stale data. 7581 7582 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7583 local address space cannot 7584 be used.* 7585 7586 1. ds_atomic 7587 2. s_waitcnt lgkmcnt(0) 7588 7589 - If OpenCL, omit. 7590 - Must happen before 7591 any following 7592 global/generic 7593 load/load 7594 atomic/store/store 7595 atomic/atomicrmw. 7596 - Ensures any 7597 following global 7598 data read is no 7599 older than the local load 7600 atomic value being 7601 acquired. 7602 7603 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7604 7605 - Use lgkmcnt(0) if not 7606 TgSplit execution mode 7607 and vmcnt(0) if TgSplit 7608 execution mode. 7609 - If OpenCL, omit 7610 lgkmcnt(0). 7611 - s_waitcnt vmcnt(0) 7612 must happen after 7613 any preceding 7614 global/generic load/store/ 7615 load atomic/store atomic/ 7616 atomicrmw. 7617 - s_waitcnt lgkmcnt(0) 7618 must happen after 7619 any preceding 7620 local/generic 7621 load/store/load 7622 atomic/store 7623 atomic/atomicrmw. 7624 - Must happen before 7625 the following 7626 atomicrmw. 7627 - Ensures that all 7628 memory operations 7629 have 7630 completed before 7631 performing the 7632 atomicrmw that is 7633 being released. 7634 7635 2. flat_atomic 7636 3. s_waitcnt lgkmcnt(0) & 7637 vmcnt(0) 7638 7639 - If not TgSplit execution 7640 mode, omit vmcnt(0). 7641 - If OpenCL, omit 7642 lgkmcnt(0). 7643 - Must happen before 7644 the following 7645 buffer_wbinvl1_vol and 7646 any following 7647 global/generic 7648 load/load 7649 atomic/store/store 7650 atomic/atomicrmw. 7651 - Ensures any 7652 following global 7653 data read is no 7654 older than a local load 7655 atomic value being 7656 acquired. 7657 7658 3. buffer_wbinvl1_vol 7659 7660 - If not TgSplit execution 7661 mode, omit. 7662 - Ensures that 7663 following 7664 loads will not see 7665 stale data. 7666 7667 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7668 vmcnt(0) 7669 7670 - If TgSplit execution mode, 7671 omit lgkmcnt(0). 7672 - If OpenCL, omit 7673 lgkmcnt(0). 7674 - Could be split into 7675 separate s_waitcnt 7676 vmcnt(0) and 7677 s_waitcnt 7678 lgkmcnt(0) to allow 7679 them to be 7680 independently moved 7681 according to the 7682 following rules. 7683 - s_waitcnt vmcnt(0) 7684 must happen after 7685 any preceding 7686 global/generic 7687 load/store/load 7688 atomic/store 7689 atomic/atomicrmw. 7690 - s_waitcnt lgkmcnt(0) 7691 must happen after 7692 any preceding 7693 local/generic 7694 load/store/load 7695 atomic/store 7696 atomic/atomicrmw. 7697 - Must happen before 7698 the following 7699 atomicrmw. 7700 - Ensures that all 7701 memory operations 7702 to global have 7703 completed before 7704 performing the 7705 atomicrmw that is 7706 being released. 7707 7708 2. buffer/global_atomic 7709 3. s_waitcnt vmcnt(0) 7710 7711 - Must happen before 7712 following 7713 buffer_wbinvl1_vol. 7714 - Ensures the 7715 atomicrmw has 7716 completed before 7717 invalidating the 7718 cache. 7719 7720 4. buffer_wbinvl1_vol 7721 7722 - Must happen before 7723 any following 7724 global/generic 7725 load/load 7726 atomic/atomicrmw. 7727 - Ensures that 7728 following loads 7729 will not see stale 7730 global data. 7731 7732 atomicrmw acq_rel - system - global 1. buffer_wbl2 7733 7734 - Must happen before 7735 following s_waitcnt. 7736 - Performs L2 writeback to 7737 ensure previous 7738 global/generic 7739 store/atomicrmw are 7740 visible at system scope. 7741 7742 2. s_waitcnt lgkmcnt(0) & 7743 vmcnt(0) 7744 7745 - If TgSplit execution mode, 7746 omit lgkmcnt(0). 7747 - If OpenCL, omit 7748 lgkmcnt(0). 7749 - Could be split into 7750 separate s_waitcnt 7751 vmcnt(0) and 7752 s_waitcnt 7753 lgkmcnt(0) to allow 7754 them to be 7755 independently moved 7756 according to the 7757 following rules. 7758 - s_waitcnt vmcnt(0) 7759 must happen after 7760 any preceding 7761 global/generic 7762 load/store/load 7763 atomic/store 7764 atomic/atomicrmw. 7765 - s_waitcnt lgkmcnt(0) 7766 must happen after 7767 any preceding 7768 local/generic 7769 load/store/load 7770 atomic/store 7771 atomic/atomicrmw. 7772 - Must happen before 7773 the following 7774 atomicrmw. 7775 - Ensures that all 7776 memory operations 7777 to global and L2 writeback 7778 have completed before 7779 performing the 7780 atomicrmw that is 7781 being released. 7782 7783 3. buffer/global_atomic 7784 4. s_waitcnt vmcnt(0) 7785 7786 - Must happen before 7787 following buffer_invl2 and 7788 buffer_wbinvl1_vol. 7789 - Ensures the 7790 atomicrmw has 7791 completed before 7792 invalidating the 7793 caches. 7794 7795 5. buffer_invl2; 7796 buffer_wbinvl1_vol 7797 7798 - Must happen before 7799 any following 7800 global/generic 7801 load/load 7802 atomic/atomicrmw. 7803 - Ensures that 7804 following 7805 loads will not see 7806 stale L1 global data, 7807 nor see stale L2 MTYPE 7808 NC global data. 7809 MTYPE RW and CC memory will 7810 never be stale in L2 due to 7811 the memory probes. 7812 7813 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7814 vmcnt(0) 7815 7816 - If TgSplit execution mode, 7817 omit lgkmcnt(0). 7818 - If OpenCL, omit 7819 lgkmcnt(0). 7820 - Could be split into 7821 separate s_waitcnt 7822 vmcnt(0) and 7823 s_waitcnt 7824 lgkmcnt(0) to allow 7825 them to be 7826 independently moved 7827 according to the 7828 following rules. 7829 - s_waitcnt vmcnt(0) 7830 must happen after 7831 any preceding 7832 global/generic 7833 load/store/load 7834 atomic/store 7835 atomic/atomicrmw. 7836 - s_waitcnt lgkmcnt(0) 7837 must happen after 7838 any preceding 7839 local/generic 7840 load/store/load 7841 atomic/store 7842 atomic/atomicrmw. 7843 - Must happen before 7844 the following 7845 atomicrmw. 7846 - Ensures that all 7847 memory operations 7848 to global have 7849 completed before 7850 performing the 7851 atomicrmw that is 7852 being released. 7853 7854 2. flat_atomic 7855 3. s_waitcnt vmcnt(0) & 7856 lgkmcnt(0) 7857 7858 - If TgSplit execution mode, 7859 omit lgkmcnt(0). 7860 - If OpenCL, omit 7861 lgkmcnt(0). 7862 - Must happen before 7863 following 7864 buffer_wbinvl1_vol. 7865 - Ensures the 7866 atomicrmw has 7867 completed before 7868 invalidating the 7869 cache. 7870 7871 4. buffer_wbinvl1_vol 7872 7873 - Must happen before 7874 any following 7875 global/generic 7876 load/load 7877 atomic/atomicrmw. 7878 - Ensures that 7879 following loads 7880 will not see stale 7881 global data. 7882 7883 atomicrmw acq_rel - system - generic 1. buffer_wbl2 7884 7885 - Must happen before 7886 following s_waitcnt. 7887 - Performs L2 writeback to 7888 ensure previous 7889 global/generic 7890 store/atomicrmw are 7891 visible at system scope. 7892 7893 2. s_waitcnt lgkmcnt(0) & 7894 vmcnt(0) 7895 7896 - If TgSplit execution mode, 7897 omit lgkmcnt(0). 7898 - If OpenCL, omit 7899 lgkmcnt(0). 7900 - Could be split into 7901 separate s_waitcnt 7902 vmcnt(0) and 7903 s_waitcnt 7904 lgkmcnt(0) to allow 7905 them to be 7906 independently moved 7907 according to the 7908 following rules. 7909 - s_waitcnt vmcnt(0) 7910 must happen after 7911 any preceding 7912 global/generic 7913 load/store/load 7914 atomic/store 7915 atomic/atomicrmw. 7916 - s_waitcnt lgkmcnt(0) 7917 must happen after 7918 any preceding 7919 local/generic 7920 load/store/load 7921 atomic/store 7922 atomic/atomicrmw. 7923 - Must happen before 7924 the following 7925 atomicrmw. 7926 - Ensures that all 7927 memory operations 7928 to global and L2 writeback 7929 have completed before 7930 performing the 7931 atomicrmw that is 7932 being released. 7933 7934 3. flat_atomic 7935 4. s_waitcnt vmcnt(0) & 7936 lgkmcnt(0) 7937 7938 - If TgSplit execution mode, 7939 omit lgkmcnt(0). 7940 - If OpenCL, omit 7941 lgkmcnt(0). 7942 - Must happen before 7943 following buffer_invl2 and 7944 buffer_wbinvl1_vol. 7945 - Ensures the 7946 atomicrmw has 7947 completed before 7948 invalidating the 7949 caches. 7950 7951 5. buffer_invl2; 7952 buffer_wbinvl1_vol 7953 7954 - Must happen before 7955 any following 7956 global/generic 7957 load/load 7958 atomic/atomicrmw. 7959 - Ensures that 7960 following 7961 loads will not see 7962 stale L1 global data, 7963 nor see stale L2 MTYPE 7964 NC global data. 7965 MTYPE RW and CC memory will 7966 never be stale in L2 due to 7967 the memory probes. 7968 7969 fence acq_rel - singlethread *none* *none* 7970 - wavefront 7971 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7972 7973 - Use lgkmcnt(0) if not 7974 TgSplit execution mode 7975 and vmcnt(0) if TgSplit 7976 execution mode. 7977 - If OpenCL and 7978 address space is 7979 not generic, omit 7980 lgkmcnt(0). 7981 - If OpenCL and 7982 address space is 7983 local, omit 7984 vmcnt(0). 7985 - However, 7986 since LLVM 7987 currently has no 7988 address space on 7989 the fence need to 7990 conservatively 7991 always generate 7992 (see comment for 7993 previous fence). 7994 - s_waitcnt vmcnt(0) 7995 must happen after 7996 any preceding 7997 global/generic 7998 load/store/ 7999 load atomic/store atomic/ 8000 atomicrmw. 8001 - s_waitcnt lgkmcnt(0) 8002 must happen after 8003 any preceding 8004 local/generic 8005 load/load 8006 atomic/store/store 8007 atomic/atomicrmw. 8008 - Must happen before 8009 any following 8010 global/generic 8011 load/load 8012 atomic/store/store 8013 atomic/atomicrmw. 8014 - Ensures that all 8015 memory operations 8016 have 8017 completed before 8018 performing any 8019 following global 8020 memory operations. 8021 - Ensures that the 8022 preceding 8023 local/generic load 8024 atomic/atomicrmw 8025 with an equal or 8026 wider sync scope 8027 and memory ordering 8028 stronger than 8029 unordered (this is 8030 termed the 8031 acquire-fence-paired-atomic) 8032 has completed 8033 before following 8034 global memory 8035 operations. This 8036 satisfies the 8037 requirements of 8038 acquire. 8039 - Ensures that all 8040 previous memory 8041 operations have 8042 completed before a 8043 following 8044 local/generic store 8045 atomic/atomicrmw 8046 with an equal or 8047 wider sync scope 8048 and memory ordering 8049 stronger than 8050 unordered (this is 8051 termed the 8052 release-fence-paired-atomic). 8053 This satisfies the 8054 requirements of 8055 release. 8056 - Must happen before 8057 the following 8058 buffer_wbinvl1_vol. 8059 - Ensures that the 8060 acquire-fence-paired 8061 atomic has completed 8062 before invalidating 8063 the 8064 cache. Therefore 8065 any following 8066 locations read must 8067 be no older than 8068 the value read by 8069 the 8070 acquire-fence-paired-atomic. 8071 8072 2. buffer_wbinvl1_vol 8073 8074 - If not TgSplit execution 8075 mode, omit. 8076 - Ensures that 8077 following 8078 loads will not see 8079 stale data. 8080 8081 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 8082 vmcnt(0) 8083 8084 - If TgSplit execution mode, 8085 omit lgkmcnt(0). 8086 - If OpenCL and 8087 address space is 8088 not generic, omit 8089 lgkmcnt(0). 8090 - However, since LLVM 8091 currently has no 8092 address space on 8093 the fence need to 8094 conservatively 8095 always generate 8096 (see comment for 8097 previous fence). 8098 - Could be split into 8099 separate s_waitcnt 8100 vmcnt(0) and 8101 s_waitcnt 8102 lgkmcnt(0) to allow 8103 them to be 8104 independently moved 8105 according to the 8106 following rules. 8107 - s_waitcnt vmcnt(0) 8108 must happen after 8109 any preceding 8110 global/generic 8111 load/store/load 8112 atomic/store 8113 atomic/atomicrmw. 8114 - s_waitcnt lgkmcnt(0) 8115 must happen after 8116 any preceding 8117 local/generic 8118 load/store/load 8119 atomic/store 8120 atomic/atomicrmw. 8121 - Must happen before 8122 the following 8123 buffer_wbinvl1_vol. 8124 - Ensures that the 8125 preceding 8126 global/local/generic 8127 load 8128 atomic/atomicrmw 8129 with an equal or 8130 wider sync scope 8131 and memory ordering 8132 stronger than 8133 unordered (this is 8134 termed the 8135 acquire-fence-paired-atomic) 8136 has completed 8137 before invalidating 8138 the cache. This 8139 satisfies the 8140 requirements of 8141 acquire. 8142 - Ensures that all 8143 previous memory 8144 operations have 8145 completed before a 8146 following 8147 global/local/generic 8148 store 8149 atomic/atomicrmw 8150 with an equal or 8151 wider sync scope 8152 and memory ordering 8153 stronger than 8154 unordered (this is 8155 termed the 8156 release-fence-paired-atomic). 8157 This satisfies the 8158 requirements of 8159 release. 8160 8161 2. buffer_wbinvl1_vol 8162 8163 - Must happen before 8164 any following 8165 global/generic 8166 load/load 8167 atomic/store/store 8168 atomic/atomicrmw. 8169 - Ensures that 8170 following loads 8171 will not see stale 8172 global data. This 8173 satisfies the 8174 requirements of 8175 acquire. 8176 8177 fence acq_rel - system *none* 1. buffer_wbl2 8178 8179 - If OpenCL and 8180 address space is 8181 local, omit. 8182 - Must happen before 8183 following s_waitcnt. 8184 - Performs L2 writeback to 8185 ensure previous 8186 global/generic 8187 store/atomicrmw are 8188 visible at system scope. 8189 8190 2. s_waitcnt lgkmcnt(0) & 8191 vmcnt(0) 8192 8193 - If TgSplit execution mode, 8194 omit lgkmcnt(0). 8195 - If OpenCL and 8196 address space is 8197 not generic, omit 8198 lgkmcnt(0). 8199 - However, since LLVM 8200 currently has no 8201 address space on 8202 the fence need to 8203 conservatively 8204 always generate 8205 (see comment for 8206 previous fence). 8207 - Could be split into 8208 separate s_waitcnt 8209 vmcnt(0) and 8210 s_waitcnt 8211 lgkmcnt(0) to allow 8212 them to be 8213 independently moved 8214 according to the 8215 following rules. 8216 - s_waitcnt vmcnt(0) 8217 must happen after 8218 any preceding 8219 global/generic 8220 load/store/load 8221 atomic/store 8222 atomic/atomicrmw. 8223 - s_waitcnt lgkmcnt(0) 8224 must happen after 8225 any preceding 8226 local/generic 8227 load/store/load 8228 atomic/store 8229 atomic/atomicrmw. 8230 - Must happen before 8231 the following buffer_invl2 and 8232 buffer_wbinvl1_vol. 8233 - Ensures that the 8234 preceding 8235 global/local/generic 8236 load 8237 atomic/atomicrmw 8238 with an equal or 8239 wider sync scope 8240 and memory ordering 8241 stronger than 8242 unordered (this is 8243 termed the 8244 acquire-fence-paired-atomic) 8245 has completed 8246 before invalidating 8247 the cache. This 8248 satisfies the 8249 requirements of 8250 acquire. 8251 - Ensures that all 8252 previous memory 8253 operations have 8254 completed before a 8255 following 8256 global/local/generic 8257 store 8258 atomic/atomicrmw 8259 with an equal or 8260 wider sync scope 8261 and memory ordering 8262 stronger than 8263 unordered (this is 8264 termed the 8265 release-fence-paired-atomic). 8266 This satisfies the 8267 requirements of 8268 release. 8269 8270 3. buffer_invl2; 8271 buffer_wbinvl1_vol 8272 8273 - Must happen before 8274 any following 8275 global/generic 8276 load/load 8277 atomic/store/store 8278 atomic/atomicrmw. 8279 - Ensures that 8280 following 8281 loads will not see 8282 stale L1 global data, 8283 nor see stale L2 MTYPE 8284 NC global data. 8285 MTYPE RW and CC memory will 8286 never be stale in L2 due to 8287 the memory probes. 8288 8289 **Sequential Consistent Atomic** 8290 ------------------------------------------------------------------------------------ 8291 load atomic seq_cst - singlethread - global *Same as corresponding 8292 - wavefront - local load atomic acquire, 8293 - generic except must generate 8294 all instructions even 8295 for OpenCL.* 8296 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8297 - generic 8298 - Use lgkmcnt(0) if not 8299 TgSplit execution mode 8300 and vmcnt(0) if TgSplit 8301 execution mode. 8302 - s_waitcnt lgkmcnt(0) must 8303 happen after 8304 preceding 8305 local/generic load 8306 atomic/store 8307 atomic/atomicrmw 8308 with memory 8309 ordering of seq_cst 8310 and with equal or 8311 wider sync scope. 8312 (Note that seq_cst 8313 fences have their 8314 own s_waitcnt 8315 lgkmcnt(0) and so do 8316 not need to be 8317 considered.) 8318 - s_waitcnt vmcnt(0) 8319 must happen after 8320 preceding 8321 global/generic load 8322 atomic/store 8323 atomic/atomicrmw 8324 with memory 8325 ordering of seq_cst 8326 and with equal or 8327 wider sync scope. 8328 (Note that seq_cst 8329 fences have their 8330 own s_waitcnt 8331 vmcnt(0) and so do 8332 not need to be 8333 considered.) 8334 - Ensures any 8335 preceding 8336 sequential 8337 consistent global/local 8338 memory instructions 8339 have completed 8340 before executing 8341 this sequentially 8342 consistent 8343 instruction. This 8344 prevents reordering 8345 a seq_cst store 8346 followed by a 8347 seq_cst load. (Note 8348 that seq_cst is 8349 stronger than 8350 acquire/release as 8351 the reordering of 8352 load acquire 8353 followed by a store 8354 release is 8355 prevented by the 8356 s_waitcnt of 8357 the release, but 8358 there is nothing 8359 preventing a store 8360 release followed by 8361 load acquire from 8362 completing out of 8363 order. The s_waitcnt 8364 could be placed after 8365 seq_store or before 8366 the seq_load. We 8367 choose the load to 8368 make the s_waitcnt be 8369 as late as possible 8370 so that the store 8371 may have already 8372 completed.) 8373 8374 2. *Following 8375 instructions same as 8376 corresponding load 8377 atomic acquire, 8378 except must generate 8379 all instructions even 8380 for OpenCL.* 8381 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8382 local address space cannot 8383 be used.* 8384 8385 *Same as corresponding 8386 load atomic acquire, 8387 except must generate 8388 all instructions even 8389 for OpenCL.* 8390 8391 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8392 - system - generic vmcnt(0) 8393 8394 - If TgSplit execution mode, 8395 omit lgkmcnt(0). 8396 - Could be split into 8397 separate s_waitcnt 8398 vmcnt(0) 8399 and s_waitcnt 8400 lgkmcnt(0) to allow 8401 them to be 8402 independently moved 8403 according to the 8404 following rules. 8405 - s_waitcnt lgkmcnt(0) 8406 must happen after 8407 preceding 8408 global/generic load 8409 atomic/store 8410 atomic/atomicrmw 8411 with memory 8412 ordering of seq_cst 8413 and with equal or 8414 wider sync scope. 8415 (Note that seq_cst 8416 fences have their 8417 own s_waitcnt 8418 lgkmcnt(0) and so do 8419 not need to be 8420 considered.) 8421 - s_waitcnt vmcnt(0) 8422 must happen after 8423 preceding 8424 global/generic load 8425 atomic/store 8426 atomic/atomicrmw 8427 with memory 8428 ordering of seq_cst 8429 and with equal or 8430 wider sync scope. 8431 (Note that seq_cst 8432 fences have their 8433 own s_waitcnt 8434 vmcnt(0) and so do 8435 not need to be 8436 considered.) 8437 - Ensures any 8438 preceding 8439 sequential 8440 consistent global 8441 memory instructions 8442 have completed 8443 before executing 8444 this sequentially 8445 consistent 8446 instruction. This 8447 prevents reordering 8448 a seq_cst store 8449 followed by a 8450 seq_cst load. (Note 8451 that seq_cst is 8452 stronger than 8453 acquire/release as 8454 the reordering of 8455 load acquire 8456 followed by a store 8457 release is 8458 prevented by the 8459 s_waitcnt of 8460 the release, but 8461 there is nothing 8462 preventing a store 8463 release followed by 8464 load acquire from 8465 completing out of 8466 order. The s_waitcnt 8467 could be placed after 8468 seq_store or before 8469 the seq_load. We 8470 choose the load to 8471 make the s_waitcnt be 8472 as late as possible 8473 so that the store 8474 may have already 8475 completed.) 8476 8477 2. *Following 8478 instructions same as 8479 corresponding load 8480 atomic acquire, 8481 except must generate 8482 all instructions even 8483 for OpenCL.* 8484 store atomic seq_cst - singlethread - global *Same as corresponding 8485 - wavefront - local store atomic release, 8486 - workgroup - generic except must generate 8487 - agent all instructions even 8488 - system for OpenCL.* 8489 atomicrmw seq_cst - singlethread - global *Same as corresponding 8490 - wavefront - local atomicrmw acq_rel, 8491 - workgroup - generic except must generate 8492 - agent all instructions even 8493 - system for OpenCL.* 8494 fence seq_cst - singlethread *none* *Same as corresponding 8495 - wavefront fence acq_rel, 8496 - workgroup except must generate 8497 - agent all instructions even 8498 - system for OpenCL.* 8499 ============ ============ ============== ========== ================================ 8500 8501.. _amdgpu-amdhsa-memory-model-gfx10: 8502 8503Memory Model GFX10 8504++++++++++++++++++ 8505 8506For GFX10: 8507 8508* Each agent has multiple shader arrays (SA). 8509* Each SA has multiple work-group processors (WGP). 8510* Each WGP has multiple compute units (CU). 8511* Each CU has multiple SIMDs that execute wavefronts. 8512* The wavefronts for a single work-group are executed in the same 8513 WGP. In CU wavefront execution mode the wavefronts may be executed by 8514 different SIMDs in the same CU. In WGP wavefront execution mode the 8515 wavefronts may be executed by different SIMDs in different CUs in the same 8516 WGP. 8517* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 8518 executing on it. 8519* All LDS operations of a WGP are performed as wavefront wide operations in a 8520 global order and involve no caching. Completion is reported to a wavefront in 8521 execution order. 8522* The LDS memory has multiple request queues shared by the SIMDs of a 8523 WGP. Therefore, the LDS operations performed by different wavefronts of a 8524 work-group can be reordered relative to each other, which can result in 8525 reordering the visibility of vector memory operations with respect to LDS 8526 operations of other wavefronts in the same work-group. A ``s_waitcnt 8527 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8528 vector memory operations between wavefronts of a work-group, but not between 8529 operations performed by the same wavefront. 8530* The vector memory operations are performed as wavefront wide operations. 8531 Completion of load/store/sample operations are reported to a wavefront in 8532 execution order of other load/store/sample operations performed by that 8533 wavefront. 8534* The vector memory operations access a vector L0 cache. There is a single L0 8535 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 8536 special action is required for coherence between the lanes of a single 8537 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 8538 wavefronts executing in the same work-group as they may be executing on SIMDs 8539 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 8540 required for coherence between wavefronts executing in different work-groups 8541 as they may be executing on different WGPs. 8542* The scalar memory operations access a scalar L0 cache shared by all wavefronts 8543 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 8544 operations are used in a restricted way so do not impact the memory model. See 8545 :ref:`amdgpu-amdhsa-memory-spaces`. 8546* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 8547 the same SA. Therefore, no special action is required for coherence between 8548 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 8549 required for coherence between wavefronts executing in different work-groups 8550 as they may be executing on different SAs that access different L1s. 8551* The L1 caches have independent quadrants to service disjoint ranges of virtual 8552 addresses. 8553* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 8554 vector and scalar memory operations performed by different wavefronts, whether 8555 executing in the same or different work-groups (which may be executing on 8556 different CUs accessing different L0s), can be reordered relative to each 8557 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 8558 synchronization between vector memory operations of different wavefronts. It 8559 ensures a previous vector memory operation has completed before executing a 8560 subsequent vector memory or LDS operation and so can be used to meet the 8561 requirements of acquire, release and sequential consistency. 8562* The L1 caches use an L2 cache shared by all SAs on the same agent. 8563* The L2 cache has independent channels to service disjoint ranges of virtual 8564 addresses. 8565* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 8566 quadrant has a separate request queue per L2 channel. Therefore, the vector 8567 and scalar memory operations performed by wavefronts executing in different 8568 work-groups (which may be executing on different SAs) of an agent can be 8569 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 8570 required to ensure synchronization between vector memory operations of 8571 different SAs. It ensures a previous vector memory operation has completed 8572 before executing a subsequent vector memory and so can be used to meet the 8573 requirements of acquire, release and sequential consistency. 8574* The L2 cache can be kept coherent with other agents on some targets, or ranges 8575 of virtual addresses can be set up to bypass it to ensure system coherence. 8576* On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory. 8577 The MALL cache is fully coherent with GPU memory and has no impact on system 8578 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 8579 8580Scalar memory operations are only used to access memory that is proven to not 8581change during the execution of the kernel dispatch. This includes constant 8582address space and global address space for program scope ``const`` variables. 8583Therefore, the kernel machine code does not have to maintain the scalar cache to 8584ensure it is coherent with the vector caches. The scalar and vector caches are 8585invalidated between kernel dispatches by CP since constant address space data 8586may change between kernel dispatch executions. See 8587:ref:`amdgpu-amdhsa-memory-spaces`. 8588 8589The one exception is if scalar writes are used to spill SGPR registers. In this 8590case the AMDGPU backend ensures the memory location used to spill is never 8591accessed by vector memory operations at the same time. If scalar writes are used 8592then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8593return since the locations may be used for vector memory instructions by a 8594future wavefront that uses the same scratch area, or a function call that 8595creates a frame at the same address, respectively. There is no need for a 8596``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8597 8598For kernarg backing memory: 8599 8600* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 8601* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 8602 needing to invalidate the L2 cache. 8603* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8604 so the L2 cache will be coherent with the CPU and other agents. 8605 8606Scratch backing memory (which is used for the private address space) is accessed 8607with MTYPE NC (non-coherent). Since the private address space is only accessed 8608by a single thread, and is always write-before-read, there is never a need to 8609invalidate these entries from the L0 or L1 caches. 8610 8611Wavefronts are executed in native mode with in-order reporting of loads and 8612sample instructions. In this mode vmcnt reports completion of load, atomic with 8613return and sample instructions in order, and the vscnt reports the completion of 8614store and atomic without return in order. See ``MEM_ORDERED`` field in 8615:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 8616 8617Wavefronts can be executed in WGP or CU wavefront execution mode: 8618 8619* In WGP wavefront execution mode the wavefronts of a work-group are executed 8620 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 8621 CU L0 caches is required for work-group synchronization. Also accesses to L1 8622 at work-group scope need to be explicitly ordered as the accesses from 8623 different CUs are not ordered. 8624* In CU wavefront execution mode the wavefronts of a work-group are executed on 8625 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 8626 the work-group access the same L0 which in turn ensures L1 accesses are 8627 ordered and so do not require explicit management of the caches for 8628 work-group synchronization. 8629 8630See ``WGP_MODE`` field in 8631:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 8632:ref:`amdgpu-target-features`. 8633 8634The code sequences used to implement the memory model for GFX10 are defined in 8635table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 8636 8637 .. table:: AMDHSA Memory Model Code Sequences GFX10 8638 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 8639 8640 ============ ============ ============== ========== ================================ 8641 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8642 Ordering Sync Scope Address GFX10 8643 Space 8644 ============ ============ ============== ========== ================================ 8645 **Non-Atomic** 8646 ------------------------------------------------------------------------------------ 8647 load *none* *none* - global - !volatile & !nontemporal 8648 - generic 8649 - private 1. buffer/global/flat_load 8650 - constant 8651 - !volatile & nontemporal 8652 8653 1. buffer/global/flat_load 8654 slc=1 8655 8656 - volatile 8657 8658 1. buffer/global/flat_load 8659 glc=1 dlc=1 8660 2. s_waitcnt vmcnt(0) 8661 8662 - Must happen before 8663 any following volatile 8664 global/generic 8665 load/store. 8666 - Ensures that 8667 volatile 8668 operations to 8669 different 8670 addresses will not 8671 be reordered by 8672 hardware. 8673 8674 load *none* *none* - local 1. ds_load 8675 store *none* *none* - global - !volatile & !nontemporal 8676 - generic 8677 - private 1. buffer/global/flat_store 8678 - constant 8679 - !volatile & nontemporal 8680 8681 1. buffer/global/flat_store 8682 glc=1 slc=1 8683 8684 - volatile 8685 8686 1. buffer/global/flat_store 8687 2. s_waitcnt vscnt(0) 8688 8689 - Must happen before 8690 any following volatile 8691 global/generic 8692 load/store. 8693 - Ensures that 8694 volatile 8695 operations to 8696 different 8697 addresses will not 8698 be reordered by 8699 hardware. 8700 8701 store *none* *none* - local 1. ds_store 8702 **Unordered Atomic** 8703 ------------------------------------------------------------------------------------ 8704 load atomic unordered *any* *any* *Same as non-atomic*. 8705 store atomic unordered *any* *any* *Same as non-atomic*. 8706 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8707 **Monotonic Atomic** 8708 ------------------------------------------------------------------------------------ 8709 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8710 - wavefront - generic 8711 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8712 - generic glc=1 8713 8714 - If CU wavefront execution 8715 mode, omit glc=1. 8716 8717 load atomic monotonic - singlethread - local 1. ds_load 8718 - wavefront 8719 - workgroup 8720 load atomic monotonic - agent - global 1. buffer/global/flat_load 8721 - system - generic glc=1 dlc=1 8722 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8723 - wavefront - generic 8724 - workgroup 8725 - agent 8726 - system 8727 store atomic monotonic - singlethread - local 1. ds_store 8728 - wavefront 8729 - workgroup 8730 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8731 - wavefront - generic 8732 - workgroup 8733 - agent 8734 - system 8735 atomicrmw monotonic - singlethread - local 1. ds_atomic 8736 - wavefront 8737 - workgroup 8738 **Acquire Atomic** 8739 ------------------------------------------------------------------------------------ 8740 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 8741 - wavefront - local 8742 - generic 8743 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 8744 8745 - If CU wavefront execution 8746 mode, omit glc=1. 8747 8748 2. s_waitcnt vmcnt(0) 8749 8750 - If CU wavefront execution 8751 mode, omit. 8752 - Must happen before 8753 the following buffer_gl0_inv 8754 and before any following 8755 global/generic 8756 load/load 8757 atomic/store/store 8758 atomic/atomicrmw. 8759 8760 3. buffer_gl0_inv 8761 8762 - If CU wavefront execution 8763 mode, omit. 8764 - Ensures that 8765 following 8766 loads will not see 8767 stale data. 8768 8769 load atomic acquire - workgroup - local 1. ds_load 8770 2. s_waitcnt lgkmcnt(0) 8771 8772 - If OpenCL, omit. 8773 - Must happen before 8774 the following buffer_gl0_inv 8775 and before any following 8776 global/generic load/load 8777 atomic/store/store 8778 atomic/atomicrmw. 8779 - Ensures any 8780 following global 8781 data read is no 8782 older than the local load 8783 atomic value being 8784 acquired. 8785 8786 3. buffer_gl0_inv 8787 8788 - If CU wavefront execution 8789 mode, omit. 8790 - If OpenCL, omit. 8791 - Ensures that 8792 following 8793 loads will not see 8794 stale data. 8795 8796 load atomic acquire - workgroup - generic 1. flat_load glc=1 8797 8798 - If CU wavefront execution 8799 mode, omit glc=1. 8800 8801 2. s_waitcnt lgkmcnt(0) & 8802 vmcnt(0) 8803 8804 - If CU wavefront execution 8805 mode, omit vmcnt(0). 8806 - If OpenCL, omit 8807 lgkmcnt(0). 8808 - Must happen before 8809 the following 8810 buffer_gl0_inv and any 8811 following global/generic 8812 load/load 8813 atomic/store/store 8814 atomic/atomicrmw. 8815 - Ensures any 8816 following global 8817 data read is no 8818 older than a local load 8819 atomic value being 8820 acquired. 8821 8822 3. buffer_gl0_inv 8823 8824 - If CU wavefront execution 8825 mode, omit. 8826 - Ensures that 8827 following 8828 loads will not see 8829 stale data. 8830 8831 load atomic acquire - agent - global 1. buffer/global_load 8832 - system glc=1 dlc=1 8833 2. s_waitcnt vmcnt(0) 8834 8835 - Must happen before 8836 following 8837 buffer_gl*_inv. 8838 - Ensures the load 8839 has completed 8840 before invalidating 8841 the caches. 8842 8843 3. buffer_gl0_inv; 8844 buffer_gl1_inv 8845 8846 - Must happen before 8847 any following 8848 global/generic 8849 load/load 8850 atomic/atomicrmw. 8851 - Ensures that 8852 following 8853 loads will not see 8854 stale global data. 8855 8856 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 8857 - system 2. s_waitcnt vmcnt(0) & 8858 lgkmcnt(0) 8859 8860 - If OpenCL omit 8861 lgkmcnt(0). 8862 - Must happen before 8863 following 8864 buffer_gl*_invl. 8865 - Ensures the flat_load 8866 has completed 8867 before invalidating 8868 the caches. 8869 8870 3. buffer_gl0_inv; 8871 buffer_gl1_inv 8872 8873 - Must happen before 8874 any following 8875 global/generic 8876 load/load 8877 atomic/atomicrmw. 8878 - Ensures that 8879 following loads 8880 will not see stale 8881 global data. 8882 8883 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 8884 - wavefront - local 8885 - generic 8886 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 8887 2. s_waitcnt vm/vscnt(0) 8888 8889 - If CU wavefront execution 8890 mode, omit. 8891 - Use vmcnt(0) if atomic with 8892 return and vscnt(0) if 8893 atomic with no-return. 8894 - Must happen before 8895 the following buffer_gl0_inv 8896 and before any following 8897 global/generic 8898 load/load 8899 atomic/store/store 8900 atomic/atomicrmw. 8901 8902 3. buffer_gl0_inv 8903 8904 - If CU wavefront execution 8905 mode, omit. 8906 - Ensures that 8907 following 8908 loads will not see 8909 stale data. 8910 8911 atomicrmw acquire - workgroup - local 1. ds_atomic 8912 2. s_waitcnt lgkmcnt(0) 8913 8914 - If OpenCL, omit. 8915 - Must happen before 8916 the following 8917 buffer_gl0_inv. 8918 - Ensures any 8919 following global 8920 data read is no 8921 older than the local 8922 atomicrmw value 8923 being acquired. 8924 8925 3. buffer_gl0_inv 8926 8927 - If OpenCL omit. 8928 - Ensures that 8929 following 8930 loads will not see 8931 stale data. 8932 8933 atomicrmw acquire - workgroup - generic 1. flat_atomic 8934 2. s_waitcnt lgkmcnt(0) & 8935 vm/vscnt(0) 8936 8937 - If CU wavefront execution 8938 mode, omit vm/vscnt(0). 8939 - If OpenCL, omit lgkmcnt(0). 8940 - Use vmcnt(0) if atomic with 8941 return and vscnt(0) if 8942 atomic with no-return. 8943 - Must happen before 8944 the following 8945 buffer_gl0_inv. 8946 - Ensures any 8947 following global 8948 data read is no 8949 older than a local 8950 atomicrmw value 8951 being acquired. 8952 8953 3. buffer_gl0_inv 8954 8955 - If CU wavefront execution 8956 mode, omit. 8957 - Ensures that 8958 following 8959 loads will not see 8960 stale data. 8961 8962 atomicrmw acquire - agent - global 1. buffer/global_atomic 8963 - system 2. s_waitcnt vm/vscnt(0) 8964 8965 - Use vmcnt(0) if atomic with 8966 return and vscnt(0) if 8967 atomic with no-return. 8968 - Must happen before 8969 following 8970 buffer_gl*_inv. 8971 - Ensures the 8972 atomicrmw has 8973 completed before 8974 invalidating the 8975 caches. 8976 8977 3. buffer_gl0_inv; 8978 buffer_gl1_inv 8979 8980 - Must happen before 8981 any following 8982 global/generic 8983 load/load 8984 atomic/atomicrmw. 8985 - Ensures that 8986 following loads 8987 will not see stale 8988 global data. 8989 8990 atomicrmw acquire - agent - generic 1. flat_atomic 8991 - system 2. s_waitcnt vm/vscnt(0) & 8992 lgkmcnt(0) 8993 8994 - If OpenCL, omit 8995 lgkmcnt(0). 8996 - Use vmcnt(0) if atomic with 8997 return and vscnt(0) if 8998 atomic with no-return. 8999 - Must happen before 9000 following 9001 buffer_gl*_inv. 9002 - Ensures the 9003 atomicrmw has 9004 completed before 9005 invalidating the 9006 caches. 9007 9008 3. buffer_gl0_inv; 9009 buffer_gl1_inv 9010 9011 - Must happen before 9012 any following 9013 global/generic 9014 load/load 9015 atomic/atomicrmw. 9016 - Ensures that 9017 following loads 9018 will not see stale 9019 global data. 9020 9021 fence acquire - singlethread *none* *none* 9022 - wavefront 9023 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9024 vmcnt(0) & vscnt(0) 9025 9026 - If CU wavefront execution 9027 mode, omit vmcnt(0) and 9028 vscnt(0). 9029 - If OpenCL and 9030 address space is 9031 not generic, omit 9032 lgkmcnt(0). 9033 - If OpenCL and 9034 address space is 9035 local, omit 9036 vmcnt(0) and vscnt(0). 9037 - However, since LLVM 9038 currently has no 9039 address space on 9040 the fence need to 9041 conservatively 9042 always generate. If 9043 fence had an 9044 address space then 9045 set to address 9046 space of OpenCL 9047 fence flag, or to 9048 generic if both 9049 local and global 9050 flags are 9051 specified. 9052 - Could be split into 9053 separate s_waitcnt 9054 vmcnt(0), s_waitcnt 9055 vscnt(0) and s_waitcnt 9056 lgkmcnt(0) to allow 9057 them to be 9058 independently moved 9059 according to the 9060 following rules. 9061 - s_waitcnt vmcnt(0) 9062 must happen after 9063 any preceding 9064 global/generic load 9065 atomic/ 9066 atomicrmw-with-return-value 9067 with an equal or 9068 wider sync scope 9069 and memory ordering 9070 stronger than 9071 unordered (this is 9072 termed the 9073 fence-paired-atomic). 9074 - s_waitcnt vscnt(0) 9075 must happen after 9076 any preceding 9077 global/generic 9078 atomicrmw-no-return-value 9079 with an equal or 9080 wider sync scope 9081 and memory ordering 9082 stronger than 9083 unordered (this is 9084 termed the 9085 fence-paired-atomic). 9086 - s_waitcnt lgkmcnt(0) 9087 must happen after 9088 any preceding 9089 local/generic load 9090 atomic/atomicrmw 9091 with an equal or 9092 wider sync scope 9093 and memory ordering 9094 stronger than 9095 unordered (this is 9096 termed the 9097 fence-paired-atomic). 9098 - Must happen before 9099 the following 9100 buffer_gl0_inv. 9101 - Ensures that the 9102 fence-paired atomic 9103 has completed 9104 before invalidating 9105 the 9106 cache. Therefore 9107 any following 9108 locations read must 9109 be no older than 9110 the value read by 9111 the 9112 fence-paired-atomic. 9113 9114 3. buffer_gl0_inv 9115 9116 - If CU wavefront execution 9117 mode, omit. 9118 - Ensures that 9119 following 9120 loads will not see 9121 stale data. 9122 9123 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 9124 - system vmcnt(0) & vscnt(0) 9125 9126 - If OpenCL and 9127 address space is 9128 not generic, omit 9129 lgkmcnt(0). 9130 - If OpenCL and 9131 address space is 9132 local, omit 9133 vmcnt(0) and vscnt(0). 9134 - However, since LLVM 9135 currently has no 9136 address space on 9137 the fence need to 9138 conservatively 9139 always generate 9140 (see comment for 9141 previous fence). 9142 - Could be split into 9143 separate s_waitcnt 9144 vmcnt(0), s_waitcnt 9145 vscnt(0) and s_waitcnt 9146 lgkmcnt(0) to allow 9147 them to be 9148 independently moved 9149 according to the 9150 following rules. 9151 - s_waitcnt vmcnt(0) 9152 must happen after 9153 any preceding 9154 global/generic load 9155 atomic/ 9156 atomicrmw-with-return-value 9157 with an equal or 9158 wider sync scope 9159 and memory ordering 9160 stronger than 9161 unordered (this is 9162 termed the 9163 fence-paired-atomic). 9164 - s_waitcnt vscnt(0) 9165 must happen after 9166 any preceding 9167 global/generic 9168 atomicrmw-no-return-value 9169 with an equal or 9170 wider sync scope 9171 and memory ordering 9172 stronger than 9173 unordered (this is 9174 termed the 9175 fence-paired-atomic). 9176 - s_waitcnt lgkmcnt(0) 9177 must happen after 9178 any preceding 9179 local/generic load 9180 atomic/atomicrmw 9181 with an equal or 9182 wider sync scope 9183 and memory ordering 9184 stronger than 9185 unordered (this is 9186 termed the 9187 fence-paired-atomic). 9188 - Must happen before 9189 the following 9190 buffer_gl*_inv. 9191 - Ensures that the 9192 fence-paired atomic 9193 has completed 9194 before invalidating 9195 the 9196 caches. Therefore 9197 any following 9198 locations read must 9199 be no older than 9200 the value read by 9201 the 9202 fence-paired-atomic. 9203 9204 2. buffer_gl0_inv; 9205 buffer_gl1_inv 9206 9207 - Must happen before any 9208 following global/generic 9209 load/load 9210 atomic/store/store 9211 atomic/atomicrmw. 9212 - Ensures that 9213 following loads 9214 will not see stale 9215 global data. 9216 9217 **Release Atomic** 9218 ------------------------------------------------------------------------------------ 9219 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 9220 - wavefront - local 9221 - generic 9222 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9223 - generic vmcnt(0) & vscnt(0) 9224 9225 - If CU wavefront execution 9226 mode, omit vmcnt(0) and 9227 vscnt(0). 9228 - If OpenCL, omit 9229 lgkmcnt(0). 9230 - Could be split into 9231 separate s_waitcnt 9232 vmcnt(0), s_waitcnt 9233 vscnt(0) and s_waitcnt 9234 lgkmcnt(0) to allow 9235 them to be 9236 independently moved 9237 according to the 9238 following rules. 9239 - s_waitcnt vmcnt(0) 9240 must happen after 9241 any preceding 9242 global/generic load/load 9243 atomic/ 9244 atomicrmw-with-return-value. 9245 - s_waitcnt vscnt(0) 9246 must happen after 9247 any preceding 9248 global/generic 9249 store/store 9250 atomic/ 9251 atomicrmw-no-return-value. 9252 - s_waitcnt lgkmcnt(0) 9253 must happen after 9254 any preceding 9255 local/generic 9256 load/store/load 9257 atomic/store 9258 atomic/atomicrmw. 9259 - Must happen before 9260 the following 9261 store. 9262 - Ensures that all 9263 memory operations 9264 have 9265 completed before 9266 performing the 9267 store that is being 9268 released. 9269 9270 2. buffer/global/flat_store 9271 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9272 9273 - If CU wavefront execution 9274 mode, omit. 9275 - If OpenCL, omit. 9276 - Could be split into 9277 separate s_waitcnt 9278 vmcnt(0) and s_waitcnt 9279 vscnt(0) to allow 9280 them to be 9281 independently moved 9282 according to the 9283 following rules. 9284 - s_waitcnt vmcnt(0) 9285 must happen after 9286 any preceding 9287 global/generic load/load 9288 atomic/ 9289 atomicrmw-with-return-value. 9290 - s_waitcnt vscnt(0) 9291 must happen after 9292 any preceding 9293 global/generic 9294 store/store atomic/ 9295 atomicrmw-no-return-value. 9296 - Must happen before 9297 the following 9298 store. 9299 - Ensures that all 9300 global memory 9301 operations have 9302 completed before 9303 performing the 9304 store that is being 9305 released. 9306 9307 2. ds_store 9308 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 9309 - system - generic vmcnt(0) & vscnt(0) 9310 9311 - If OpenCL and 9312 address space is 9313 not generic, omit 9314 lgkmcnt(0). 9315 - Could be split into 9316 separate s_waitcnt 9317 vmcnt(0), s_waitcnt vscnt(0) 9318 and s_waitcnt 9319 lgkmcnt(0) to allow 9320 them to be 9321 independently moved 9322 according to the 9323 following rules. 9324 - s_waitcnt vmcnt(0) 9325 must happen after 9326 any preceding 9327 global/generic 9328 load/load 9329 atomic/ 9330 atomicrmw-with-return-value. 9331 - s_waitcnt vscnt(0) 9332 must happen after 9333 any preceding 9334 global/generic 9335 store/store atomic/ 9336 atomicrmw-no-return-value. 9337 - s_waitcnt lgkmcnt(0) 9338 must happen after 9339 any preceding 9340 local/generic 9341 load/store/load 9342 atomic/store 9343 atomic/atomicrmw. 9344 - Must happen before 9345 the following 9346 store. 9347 - Ensures that all 9348 memory operations 9349 have 9350 completed before 9351 performing the 9352 store that is being 9353 released. 9354 9355 2. buffer/global/flat_store 9356 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 9357 - wavefront - local 9358 - generic 9359 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9360 - generic vmcnt(0) & vscnt(0) 9361 9362 - If CU wavefront execution 9363 mode, omit vmcnt(0) and 9364 vscnt(0). 9365 - If OpenCL, omit lgkmcnt(0). 9366 - Could be split into 9367 separate s_waitcnt 9368 vmcnt(0), s_waitcnt 9369 vscnt(0) and s_waitcnt 9370 lgkmcnt(0) to allow 9371 them to be 9372 independently moved 9373 according to the 9374 following rules. 9375 - s_waitcnt vmcnt(0) 9376 must happen after 9377 any preceding 9378 global/generic load/load 9379 atomic/ 9380 atomicrmw-with-return-value. 9381 - s_waitcnt vscnt(0) 9382 must happen after 9383 any preceding 9384 global/generic 9385 store/store 9386 atomic/ 9387 atomicrmw-no-return-value. 9388 - s_waitcnt lgkmcnt(0) 9389 must happen after 9390 any preceding 9391 local/generic 9392 load/store/load 9393 atomic/store 9394 atomic/atomicrmw. 9395 - Must happen before 9396 the following 9397 atomicrmw. 9398 - Ensures that all 9399 memory operations 9400 have 9401 completed before 9402 performing the 9403 atomicrmw that is 9404 being released. 9405 9406 2. buffer/global/flat_atomic 9407 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9408 9409 - If CU wavefront execution 9410 mode, omit. 9411 - If OpenCL, omit. 9412 - Could be split into 9413 separate s_waitcnt 9414 vmcnt(0) and s_waitcnt 9415 vscnt(0) to allow 9416 them to be 9417 independently moved 9418 according to the 9419 following rules. 9420 - s_waitcnt vmcnt(0) 9421 must happen after 9422 any preceding 9423 global/generic load/load 9424 atomic/ 9425 atomicrmw-with-return-value. 9426 - s_waitcnt vscnt(0) 9427 must happen after 9428 any preceding 9429 global/generic 9430 store/store atomic/ 9431 atomicrmw-no-return-value. 9432 - Must happen before 9433 the following 9434 store. 9435 - Ensures that all 9436 global memory 9437 operations have 9438 completed before 9439 performing the 9440 store that is being 9441 released. 9442 9443 2. ds_atomic 9444 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 9445 - system - generic vmcnt(0) & vscnt(0) 9446 9447 - If OpenCL, omit 9448 lgkmcnt(0). 9449 - Could be split into 9450 separate s_waitcnt 9451 vmcnt(0), s_waitcnt 9452 vscnt(0) and s_waitcnt 9453 lgkmcnt(0) to allow 9454 them to be 9455 independently moved 9456 according to the 9457 following rules. 9458 - s_waitcnt vmcnt(0) 9459 must happen after 9460 any preceding 9461 global/generic 9462 load/load atomic/ 9463 atomicrmw-with-return-value. 9464 - s_waitcnt vscnt(0) 9465 must happen after 9466 any preceding 9467 global/generic 9468 store/store atomic/ 9469 atomicrmw-no-return-value. 9470 - s_waitcnt lgkmcnt(0) 9471 must happen after 9472 any preceding 9473 local/generic 9474 load/store/load 9475 atomic/store 9476 atomic/atomicrmw. 9477 - Must happen before 9478 the following 9479 atomicrmw. 9480 - Ensures that all 9481 memory operations 9482 to global and local 9483 have completed 9484 before performing 9485 the atomicrmw that 9486 is being released. 9487 9488 2. buffer/global/flat_atomic 9489 fence release - singlethread *none* *none* 9490 - wavefront 9491 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9492 vmcnt(0) & vscnt(0) 9493 9494 - If CU wavefront execution 9495 mode, omit vmcnt(0) and 9496 vscnt(0). 9497 - If OpenCL and 9498 address space is 9499 not generic, omit 9500 lgkmcnt(0). 9501 - If OpenCL and 9502 address space is 9503 local, omit 9504 vmcnt(0) and vscnt(0). 9505 - However, since LLVM 9506 currently has no 9507 address space on 9508 the fence need to 9509 conservatively 9510 always generate. If 9511 fence had an 9512 address space then 9513 set to address 9514 space of OpenCL 9515 fence flag, or to 9516 generic if both 9517 local and global 9518 flags are 9519 specified. 9520 - Could be split into 9521 separate s_waitcnt 9522 vmcnt(0), s_waitcnt 9523 vscnt(0) and s_waitcnt 9524 lgkmcnt(0) to allow 9525 them to be 9526 independently moved 9527 according to the 9528 following rules. 9529 - s_waitcnt vmcnt(0) 9530 must happen after 9531 any preceding 9532 global/generic 9533 load/load 9534 atomic/ 9535 atomicrmw-with-return-value. 9536 - s_waitcnt vscnt(0) 9537 must happen after 9538 any preceding 9539 global/generic 9540 store/store atomic/ 9541 atomicrmw-no-return-value. 9542 - s_waitcnt lgkmcnt(0) 9543 must happen after 9544 any preceding 9545 local/generic 9546 load/store/load 9547 atomic/store atomic/ 9548 atomicrmw. 9549 - Must happen before 9550 any following store 9551 atomic/atomicrmw 9552 with an equal or 9553 wider sync scope 9554 and memory ordering 9555 stronger than 9556 unordered (this is 9557 termed the 9558 fence-paired-atomic). 9559 - Ensures that all 9560 memory operations 9561 have 9562 completed before 9563 performing the 9564 following 9565 fence-paired-atomic. 9566 9567 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 9568 - system vmcnt(0) & vscnt(0) 9569 9570 - If OpenCL and 9571 address space is 9572 not generic, omit 9573 lgkmcnt(0). 9574 - If OpenCL and 9575 address space is 9576 local, omit 9577 vmcnt(0) and vscnt(0). 9578 - However, since LLVM 9579 currently has no 9580 address space on 9581 the fence need to 9582 conservatively 9583 always generate. If 9584 fence had an 9585 address space then 9586 set to address 9587 space of OpenCL 9588 fence flag, or to 9589 generic if both 9590 local and global 9591 flags are 9592 specified. 9593 - Could be split into 9594 separate s_waitcnt 9595 vmcnt(0), s_waitcnt 9596 vscnt(0) and s_waitcnt 9597 lgkmcnt(0) to allow 9598 them to be 9599 independently moved 9600 according to the 9601 following rules. 9602 - s_waitcnt vmcnt(0) 9603 must happen after 9604 any preceding 9605 global/generic 9606 load/load atomic/ 9607 atomicrmw-with-return-value. 9608 - s_waitcnt vscnt(0) 9609 must happen after 9610 any preceding 9611 global/generic 9612 store/store atomic/ 9613 atomicrmw-no-return-value. 9614 - s_waitcnt lgkmcnt(0) 9615 must happen after 9616 any preceding 9617 local/generic 9618 load/store/load 9619 atomic/store 9620 atomic/atomicrmw. 9621 - Must happen before 9622 any following store 9623 atomic/atomicrmw 9624 with an equal or 9625 wider sync scope 9626 and memory ordering 9627 stronger than 9628 unordered (this is 9629 termed the 9630 fence-paired-atomic). 9631 - Ensures that all 9632 memory operations 9633 have 9634 completed before 9635 performing the 9636 following 9637 fence-paired-atomic. 9638 9639 **Acquire-Release Atomic** 9640 ------------------------------------------------------------------------------------ 9641 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 9642 - wavefront - local 9643 - generic 9644 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9645 vmcnt(0) & vscnt(0) 9646 9647 - If CU wavefront execution 9648 mode, omit vmcnt(0) and 9649 vscnt(0). 9650 - If OpenCL, omit 9651 lgkmcnt(0). 9652 - Must happen after 9653 any preceding 9654 local/generic 9655 load/store/load 9656 atomic/store 9657 atomic/atomicrmw. 9658 - Could be split into 9659 separate s_waitcnt 9660 vmcnt(0), s_waitcnt 9661 vscnt(0), and s_waitcnt 9662 lgkmcnt(0) to allow 9663 them to be 9664 independently moved 9665 according to the 9666 following rules. 9667 - s_waitcnt vmcnt(0) 9668 must happen after 9669 any preceding 9670 global/generic load/load 9671 atomic/ 9672 atomicrmw-with-return-value. 9673 - s_waitcnt vscnt(0) 9674 must happen after 9675 any preceding 9676 global/generic 9677 store/store 9678 atomic/ 9679 atomicrmw-no-return-value. 9680 - s_waitcnt lgkmcnt(0) 9681 must happen after 9682 any preceding 9683 local/generic 9684 load/store/load 9685 atomic/store 9686 atomic/atomicrmw. 9687 - Must happen before 9688 the following 9689 atomicrmw. 9690 - Ensures that all 9691 memory operations 9692 have 9693 completed before 9694 performing the 9695 atomicrmw that is 9696 being released. 9697 9698 2. buffer/global_atomic 9699 3. s_waitcnt vm/vscnt(0) 9700 9701 - If CU wavefront execution 9702 mode, omit. 9703 - Use vmcnt(0) if atomic with 9704 return and vscnt(0) if 9705 atomic with no-return. 9706 - Must happen before 9707 the following 9708 buffer_gl0_inv. 9709 - Ensures any 9710 following global 9711 data read is no 9712 older than the 9713 atomicrmw value 9714 being acquired. 9715 9716 4. buffer_gl0_inv 9717 9718 - If CU wavefront execution 9719 mode, omit. 9720 - Ensures that 9721 following 9722 loads will not see 9723 stale data. 9724 9725 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9726 9727 - If CU wavefront execution 9728 mode, omit. 9729 - If OpenCL, omit. 9730 - Could be split into 9731 separate s_waitcnt 9732 vmcnt(0) and s_waitcnt 9733 vscnt(0) to allow 9734 them to be 9735 independently moved 9736 according to the 9737 following rules. 9738 - s_waitcnt vmcnt(0) 9739 must happen after 9740 any preceding 9741 global/generic load/load 9742 atomic/ 9743 atomicrmw-with-return-value. 9744 - s_waitcnt vscnt(0) 9745 must happen after 9746 any preceding 9747 global/generic 9748 store/store atomic/ 9749 atomicrmw-no-return-value. 9750 - Must happen before 9751 the following 9752 store. 9753 - Ensures that all 9754 global memory 9755 operations have 9756 completed before 9757 performing the 9758 store that is being 9759 released. 9760 9761 2. ds_atomic 9762 3. s_waitcnt lgkmcnt(0) 9763 9764 - If OpenCL, omit. 9765 - Must happen before 9766 the following 9767 buffer_gl0_inv. 9768 - Ensures any 9769 following global 9770 data read is no 9771 older than the local load 9772 atomic value being 9773 acquired. 9774 9775 4. buffer_gl0_inv 9776 9777 - If CU wavefront execution 9778 mode, omit. 9779 - If OpenCL omit. 9780 - Ensures that 9781 following 9782 loads will not see 9783 stale data. 9784 9785 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 9786 vmcnt(0) & vscnt(0) 9787 9788 - If CU wavefront execution 9789 mode, omit vmcnt(0) and 9790 vscnt(0). 9791 - If OpenCL, omit lgkmcnt(0). 9792 - Could be split into 9793 separate s_waitcnt 9794 vmcnt(0), s_waitcnt 9795 vscnt(0) and s_waitcnt 9796 lgkmcnt(0) to allow 9797 them to be 9798 independently moved 9799 according to the 9800 following rules. 9801 - s_waitcnt vmcnt(0) 9802 must happen after 9803 any preceding 9804 global/generic load/load 9805 atomic/ 9806 atomicrmw-with-return-value. 9807 - s_waitcnt vscnt(0) 9808 must happen after 9809 any preceding 9810 global/generic 9811 store/store 9812 atomic/ 9813 atomicrmw-no-return-value. 9814 - s_waitcnt lgkmcnt(0) 9815 must happen after 9816 any preceding 9817 local/generic 9818 load/store/load 9819 atomic/store 9820 atomic/atomicrmw. 9821 - Must happen before 9822 the following 9823 atomicrmw. 9824 - Ensures that all 9825 memory operations 9826 have 9827 completed before 9828 performing the 9829 atomicrmw that is 9830 being released. 9831 9832 2. flat_atomic 9833 3. s_waitcnt lgkmcnt(0) & 9834 vmcnt(0) & vscnt(0) 9835 9836 - If CU wavefront execution 9837 mode, omit vmcnt(0) and 9838 vscnt(0). 9839 - If OpenCL, omit lgkmcnt(0). 9840 - Must happen before 9841 the following 9842 buffer_gl0_inv. 9843 - Ensures any 9844 following global 9845 data read is no 9846 older than the load 9847 atomic value being 9848 acquired. 9849 9850 3. buffer_gl0_inv 9851 9852 - If CU wavefront execution 9853 mode, omit. 9854 - Ensures that 9855 following 9856 loads will not see 9857 stale data. 9858 9859 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 9860 - system vmcnt(0) & vscnt(0) 9861 9862 - If OpenCL, omit 9863 lgkmcnt(0). 9864 - Could be split into 9865 separate s_waitcnt 9866 vmcnt(0), s_waitcnt 9867 vscnt(0) and s_waitcnt 9868 lgkmcnt(0) to allow 9869 them to be 9870 independently moved 9871 according to the 9872 following rules. 9873 - s_waitcnt vmcnt(0) 9874 must happen after 9875 any preceding 9876 global/generic 9877 load/load atomic/ 9878 atomicrmw-with-return-value. 9879 - s_waitcnt vscnt(0) 9880 must happen after 9881 any preceding 9882 global/generic 9883 store/store atomic/ 9884 atomicrmw-no-return-value. 9885 - s_waitcnt lgkmcnt(0) 9886 must happen after 9887 any preceding 9888 local/generic 9889 load/store/load 9890 atomic/store 9891 atomic/atomicrmw. 9892 - Must happen before 9893 the following 9894 atomicrmw. 9895 - Ensures that all 9896 memory operations 9897 to global have 9898 completed before 9899 performing the 9900 atomicrmw that is 9901 being released. 9902 9903 2. buffer/global_atomic 9904 3. s_waitcnt vm/vscnt(0) 9905 9906 - Use vmcnt(0) if atomic with 9907 return and vscnt(0) if 9908 atomic with no-return. 9909 - Must happen before 9910 following 9911 buffer_gl*_inv. 9912 - Ensures the 9913 atomicrmw has 9914 completed before 9915 invalidating the 9916 caches. 9917 9918 4. buffer_gl0_inv; 9919 buffer_gl1_inv 9920 9921 - Must happen before 9922 any following 9923 global/generic 9924 load/load 9925 atomic/atomicrmw. 9926 - Ensures that 9927 following loads 9928 will not see stale 9929 global data. 9930 9931 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 9932 - system vmcnt(0) & vscnt(0) 9933 9934 - If OpenCL, omit 9935 lgkmcnt(0). 9936 - Could be split into 9937 separate s_waitcnt 9938 vmcnt(0), s_waitcnt 9939 vscnt(0), and s_waitcnt 9940 lgkmcnt(0) to allow 9941 them to be 9942 independently moved 9943 according to the 9944 following rules. 9945 - s_waitcnt vmcnt(0) 9946 must happen after 9947 any preceding 9948 global/generic 9949 load/load atomic 9950 atomicrmw-with-return-value. 9951 - s_waitcnt vscnt(0) 9952 must happen after 9953 any preceding 9954 global/generic 9955 store/store atomic/ 9956 atomicrmw-no-return-value. 9957 - s_waitcnt lgkmcnt(0) 9958 must happen after 9959 any preceding 9960 local/generic 9961 load/store/load 9962 atomic/store 9963 atomic/atomicrmw. 9964 - Must happen before 9965 the following 9966 atomicrmw. 9967 - Ensures that all 9968 memory operations 9969 have 9970 completed before 9971 performing the 9972 atomicrmw that is 9973 being released. 9974 9975 2. flat_atomic 9976 3. s_waitcnt vm/vscnt(0) & 9977 lgkmcnt(0) 9978 9979 - If OpenCL, omit 9980 lgkmcnt(0). 9981 - Use vmcnt(0) if atomic with 9982 return and vscnt(0) if 9983 atomic with no-return. 9984 - Must happen before 9985 following 9986 buffer_gl*_inv. 9987 - Ensures the 9988 atomicrmw has 9989 completed before 9990 invalidating the 9991 caches. 9992 9993 4. buffer_gl0_inv; 9994 buffer_gl1_inv 9995 9996 - Must happen before 9997 any following 9998 global/generic 9999 load/load 10000 atomic/atomicrmw. 10001 - Ensures that 10002 following loads 10003 will not see stale 10004 global data. 10005 10006 fence acq_rel - singlethread *none* *none* 10007 - wavefront 10008 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 10009 vmcnt(0) & vscnt(0) 10010 10011 - If CU wavefront execution 10012 mode, omit vmcnt(0) and 10013 vscnt(0). 10014 - If OpenCL and 10015 address space is 10016 not generic, omit 10017 lgkmcnt(0). 10018 - If OpenCL and 10019 address space is 10020 local, omit 10021 vmcnt(0) and vscnt(0). 10022 - However, 10023 since LLVM 10024 currently has no 10025 address space on 10026 the fence need to 10027 conservatively 10028 always generate 10029 (see comment for 10030 previous fence). 10031 - Could be split into 10032 separate s_waitcnt 10033 vmcnt(0), s_waitcnt 10034 vscnt(0) and s_waitcnt 10035 lgkmcnt(0) to allow 10036 them to be 10037 independently moved 10038 according to the 10039 following rules. 10040 - s_waitcnt vmcnt(0) 10041 must happen after 10042 any preceding 10043 global/generic 10044 load/load 10045 atomic/ 10046 atomicrmw-with-return-value. 10047 - s_waitcnt vscnt(0) 10048 must happen after 10049 any preceding 10050 global/generic 10051 store/store atomic/ 10052 atomicrmw-no-return-value. 10053 - s_waitcnt lgkmcnt(0) 10054 must happen after 10055 any preceding 10056 local/generic 10057 load/store/load 10058 atomic/store atomic/ 10059 atomicrmw. 10060 - Must happen before 10061 any following 10062 global/generic 10063 load/load 10064 atomic/store/store 10065 atomic/atomicrmw. 10066 - Ensures that all 10067 memory operations 10068 have 10069 completed before 10070 performing any 10071 following global 10072 memory operations. 10073 - Ensures that the 10074 preceding 10075 local/generic load 10076 atomic/atomicrmw 10077 with an equal or 10078 wider sync scope 10079 and memory ordering 10080 stronger than 10081 unordered (this is 10082 termed the 10083 acquire-fence-paired-atomic) 10084 has completed 10085 before following 10086 global memory 10087 operations. This 10088 satisfies the 10089 requirements of 10090 acquire. 10091 - Ensures that all 10092 previous memory 10093 operations have 10094 completed before a 10095 following 10096 local/generic store 10097 atomic/atomicrmw 10098 with an equal or 10099 wider sync scope 10100 and memory ordering 10101 stronger than 10102 unordered (this is 10103 termed the 10104 release-fence-paired-atomic). 10105 This satisfies the 10106 requirements of 10107 release. 10108 - Must happen before 10109 the following 10110 buffer_gl0_inv. 10111 - Ensures that the 10112 acquire-fence-paired 10113 atomic has completed 10114 before invalidating 10115 the 10116 cache. Therefore 10117 any following 10118 locations read must 10119 be no older than 10120 the value read by 10121 the 10122 acquire-fence-paired-atomic. 10123 10124 3. buffer_gl0_inv 10125 10126 - If CU wavefront execution 10127 mode, omit. 10128 - Ensures that 10129 following 10130 loads will not see 10131 stale data. 10132 10133 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 10134 - system vmcnt(0) & vscnt(0) 10135 10136 - If OpenCL and 10137 address space is 10138 not generic, omit 10139 lgkmcnt(0). 10140 - If OpenCL and 10141 address space is 10142 local, omit 10143 vmcnt(0) and vscnt(0). 10144 - However, since LLVM 10145 currently has no 10146 address space on 10147 the fence need to 10148 conservatively 10149 always generate 10150 (see comment for 10151 previous fence). 10152 - Could be split into 10153 separate s_waitcnt 10154 vmcnt(0), s_waitcnt 10155 vscnt(0) and s_waitcnt 10156 lgkmcnt(0) to allow 10157 them to be 10158 independently moved 10159 according to the 10160 following rules. 10161 - s_waitcnt vmcnt(0) 10162 must happen after 10163 any preceding 10164 global/generic 10165 load/load 10166 atomic/ 10167 atomicrmw-with-return-value. 10168 - s_waitcnt vscnt(0) 10169 must happen after 10170 any preceding 10171 global/generic 10172 store/store atomic/ 10173 atomicrmw-no-return-value. 10174 - s_waitcnt lgkmcnt(0) 10175 must happen after 10176 any preceding 10177 local/generic 10178 load/store/load 10179 atomic/store 10180 atomic/atomicrmw. 10181 - Must happen before 10182 the following 10183 buffer_gl*_inv. 10184 - Ensures that the 10185 preceding 10186 global/local/generic 10187 load 10188 atomic/atomicrmw 10189 with an equal or 10190 wider sync scope 10191 and memory ordering 10192 stronger than 10193 unordered (this is 10194 termed the 10195 acquire-fence-paired-atomic) 10196 has completed 10197 before invalidating 10198 the caches. This 10199 satisfies the 10200 requirements of 10201 acquire. 10202 - Ensures that all 10203 previous memory 10204 operations have 10205 completed before a 10206 following 10207 global/local/generic 10208 store 10209 atomic/atomicrmw 10210 with an equal or 10211 wider sync scope 10212 and memory ordering 10213 stronger than 10214 unordered (this is 10215 termed the 10216 release-fence-paired-atomic). 10217 This satisfies the 10218 requirements of 10219 release. 10220 10221 2. buffer_gl0_inv; 10222 buffer_gl1_inv 10223 10224 - Must happen before 10225 any following 10226 global/generic 10227 load/load 10228 atomic/store/store 10229 atomic/atomicrmw. 10230 - Ensures that 10231 following loads 10232 will not see stale 10233 global data. This 10234 satisfies the 10235 requirements of 10236 acquire. 10237 10238 **Sequential Consistent Atomic** 10239 ------------------------------------------------------------------------------------ 10240 load atomic seq_cst - singlethread - global *Same as corresponding 10241 - wavefront - local load atomic acquire, 10242 - generic except must generate 10243 all instructions even 10244 for OpenCL.* 10245 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 10246 - generic vmcnt(0) & vscnt(0) 10247 10248 - If CU wavefront execution 10249 mode, omit vmcnt(0) and 10250 vscnt(0). 10251 - Could be split into 10252 separate s_waitcnt 10253 vmcnt(0), s_waitcnt 10254 vscnt(0), and s_waitcnt 10255 lgkmcnt(0) to allow 10256 them to be 10257 independently moved 10258 according to the 10259 following rules. 10260 - s_waitcnt lgkmcnt(0) must 10261 happen after 10262 preceding 10263 local/generic load 10264 atomic/store 10265 atomic/atomicrmw 10266 with memory 10267 ordering of seq_cst 10268 and with equal or 10269 wider sync scope. 10270 (Note that seq_cst 10271 fences have their 10272 own s_waitcnt 10273 lgkmcnt(0) and so do 10274 not need to be 10275 considered.) 10276 - s_waitcnt vmcnt(0) 10277 must happen after 10278 preceding 10279 global/generic load 10280 atomic/ 10281 atomicrmw-with-return-value 10282 with memory 10283 ordering of seq_cst 10284 and with equal or 10285 wider sync scope. 10286 (Note that seq_cst 10287 fences have their 10288 own s_waitcnt 10289 vmcnt(0) and so do 10290 not need to be 10291 considered.) 10292 - s_waitcnt vscnt(0) 10293 Must happen after 10294 preceding 10295 global/generic store 10296 atomic/ 10297 atomicrmw-no-return-value 10298 with memory 10299 ordering of seq_cst 10300 and with equal or 10301 wider sync scope. 10302 (Note that seq_cst 10303 fences have their 10304 own s_waitcnt 10305 vscnt(0) and so do 10306 not need to be 10307 considered.) 10308 - Ensures any 10309 preceding 10310 sequential 10311 consistent global/local 10312 memory instructions 10313 have completed 10314 before executing 10315 this sequentially 10316 consistent 10317 instruction. This 10318 prevents reordering 10319 a seq_cst store 10320 followed by a 10321 seq_cst load. (Note 10322 that seq_cst is 10323 stronger than 10324 acquire/release as 10325 the reordering of 10326 load acquire 10327 followed by a store 10328 release is 10329 prevented by the 10330 s_waitcnt of 10331 the release, but 10332 there is nothing 10333 preventing a store 10334 release followed by 10335 load acquire from 10336 completing out of 10337 order. The s_waitcnt 10338 could be placed after 10339 seq_store or before 10340 the seq_load. We 10341 choose the load to 10342 make the s_waitcnt be 10343 as late as possible 10344 so that the store 10345 may have already 10346 completed.) 10347 10348 2. *Following 10349 instructions same as 10350 corresponding load 10351 atomic acquire, 10352 except must generate 10353 all instructions even 10354 for OpenCL.* 10355 load atomic seq_cst - workgroup - local 10356 10357 1. s_waitcnt vmcnt(0) & vscnt(0) 10358 10359 - If CU wavefront execution 10360 mode, omit. 10361 - Could be split into 10362 separate s_waitcnt 10363 vmcnt(0) and s_waitcnt 10364 vscnt(0) to allow 10365 them to be 10366 independently moved 10367 according to the 10368 following rules. 10369 - s_waitcnt vmcnt(0) 10370 Must happen after 10371 preceding 10372 global/generic load 10373 atomic/ 10374 atomicrmw-with-return-value 10375 with memory 10376 ordering of seq_cst 10377 and with equal or 10378 wider sync scope. 10379 (Note that seq_cst 10380 fences have their 10381 own s_waitcnt 10382 vmcnt(0) and so do 10383 not need to be 10384 considered.) 10385 - s_waitcnt vscnt(0) 10386 Must happen after 10387 preceding 10388 global/generic store 10389 atomic/ 10390 atomicrmw-no-return-value 10391 with memory 10392 ordering of seq_cst 10393 and with equal or 10394 wider sync scope. 10395 (Note that seq_cst 10396 fences have their 10397 own s_waitcnt 10398 vscnt(0) and so do 10399 not need to be 10400 considered.) 10401 - Ensures any 10402 preceding 10403 sequential 10404 consistent global 10405 memory instructions 10406 have completed 10407 before executing 10408 this sequentially 10409 consistent 10410 instruction. This 10411 prevents reordering 10412 a seq_cst store 10413 followed by a 10414 seq_cst load. (Note 10415 that seq_cst is 10416 stronger than 10417 acquire/release as 10418 the reordering of 10419 load acquire 10420 followed by a store 10421 release is 10422 prevented by the 10423 s_waitcnt of 10424 the release, but 10425 there is nothing 10426 preventing a store 10427 release followed by 10428 load acquire from 10429 completing out of 10430 order. The s_waitcnt 10431 could be placed after 10432 seq_store or before 10433 the seq_load. We 10434 choose the load to 10435 make the s_waitcnt be 10436 as late as possible 10437 so that the store 10438 may have already 10439 completed.) 10440 10441 2. *Following 10442 instructions same as 10443 corresponding load 10444 atomic acquire, 10445 except must generate 10446 all instructions even 10447 for OpenCL.* 10448 10449 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 10450 - system - generic vmcnt(0) & vscnt(0) 10451 10452 - Could be split into 10453 separate s_waitcnt 10454 vmcnt(0), s_waitcnt 10455 vscnt(0) and s_waitcnt 10456 lgkmcnt(0) to allow 10457 them to be 10458 independently moved 10459 according to the 10460 following rules. 10461 - s_waitcnt lgkmcnt(0) 10462 must happen after 10463 preceding 10464 local load 10465 atomic/store 10466 atomic/atomicrmw 10467 with memory 10468 ordering of seq_cst 10469 and with equal or 10470 wider sync scope. 10471 (Note that seq_cst 10472 fences have their 10473 own s_waitcnt 10474 lgkmcnt(0) and so do 10475 not need to be 10476 considered.) 10477 - s_waitcnt vmcnt(0) 10478 must happen after 10479 preceding 10480 global/generic load 10481 atomic/ 10482 atomicrmw-with-return-value 10483 with memory 10484 ordering of seq_cst 10485 and with equal or 10486 wider sync scope. 10487 (Note that seq_cst 10488 fences have their 10489 own s_waitcnt 10490 vmcnt(0) and so do 10491 not need to be 10492 considered.) 10493 - s_waitcnt vscnt(0) 10494 Must happen after 10495 preceding 10496 global/generic store 10497 atomic/ 10498 atomicrmw-no-return-value 10499 with memory 10500 ordering of seq_cst 10501 and with equal or 10502 wider sync scope. 10503 (Note that seq_cst 10504 fences have their 10505 own s_waitcnt 10506 vscnt(0) and so do 10507 not need to be 10508 considered.) 10509 - Ensures any 10510 preceding 10511 sequential 10512 consistent global 10513 memory instructions 10514 have completed 10515 before executing 10516 this sequentially 10517 consistent 10518 instruction. This 10519 prevents reordering 10520 a seq_cst store 10521 followed by a 10522 seq_cst load. (Note 10523 that seq_cst is 10524 stronger than 10525 acquire/release as 10526 the reordering of 10527 load acquire 10528 followed by a store 10529 release is 10530 prevented by the 10531 s_waitcnt of 10532 the release, but 10533 there is nothing 10534 preventing a store 10535 release followed by 10536 load acquire from 10537 completing out of 10538 order. The s_waitcnt 10539 could be placed after 10540 seq_store or before 10541 the seq_load. We 10542 choose the load to 10543 make the s_waitcnt be 10544 as late as possible 10545 so that the store 10546 may have already 10547 completed.) 10548 10549 2. *Following 10550 instructions same as 10551 corresponding load 10552 atomic acquire, 10553 except must generate 10554 all instructions even 10555 for OpenCL.* 10556 store atomic seq_cst - singlethread - global *Same as corresponding 10557 - wavefront - local store atomic release, 10558 - workgroup - generic except must generate 10559 - agent all instructions even 10560 - system for OpenCL.* 10561 atomicrmw seq_cst - singlethread - global *Same as corresponding 10562 - wavefront - local atomicrmw acq_rel, 10563 - workgroup - generic except must generate 10564 - agent all instructions even 10565 - system for OpenCL.* 10566 fence seq_cst - singlethread *none* *Same as corresponding 10567 - wavefront fence acq_rel, 10568 - workgroup except must generate 10569 - agent all instructions even 10570 - system for OpenCL.* 10571 ============ ============ ============== ========== ================================ 10572 10573Trap Handler ABI 10574~~~~~~~~~~~~~~~~ 10575 10576For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 10577runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 10578supports the ``s_trap`` instruction. For usage see: 10579 10580- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 10581- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 10582- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table` 10583 10584 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 10585 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 10586 10587 =================== =============== =============== ======================================= 10588 Usage Code Sequence Trap Handler Description 10589 Inputs 10590 =================== =============== =============== ======================================= 10591 reserved ``s_trap 0x00`` Reserved by hardware. 10592 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 10593 ``queue_ptr`` intrinsic (not implemented). 10594 ``VGPR0``: 10595 ``arg`` 10596 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10597 ``queue_ptr`` the trap instruction. The associated 10598 queue is signalled to put it into the 10599 error state. When the queue is put in 10600 the error state, the waves executing 10601 dispatches on the queue will be 10602 terminated. 10603 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10604 as a no-operation. The trap handler 10605 is entered and immediately returns to 10606 continue execution of the wavefront. 10607 - If the debugger is enabled, causes 10608 the debug trap to be reported by the 10609 debugger and the wavefront is put in 10610 the halt state with the PC at the 10611 instruction. The debugger must 10612 increment the PC and resume the wave. 10613 reserved ``s_trap 0x04`` Reserved. 10614 reserved ``s_trap 0x05`` Reserved. 10615 reserved ``s_trap 0x06`` Reserved. 10616 reserved ``s_trap 0x07`` Reserved. 10617 reserved ``s_trap 0x08`` Reserved. 10618 reserved ``s_trap 0xfe`` Reserved. 10619 reserved ``s_trap 0xff`` Reserved. 10620 =================== =============== =============== ======================================= 10621 10622.. 10623 10624 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 10625 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 10626 10627 =================== =============== =============== ======================================= 10628 Usage Code Sequence Trap Handler Description 10629 Inputs 10630 =================== =============== =============== ======================================= 10631 reserved ``s_trap 0x00`` Reserved by hardware. 10632 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 10633 breakpoints. Causes wave to be halted 10634 with the PC at the trap instruction. 10635 The debugger is responsible to resume 10636 the wave, including the instruction 10637 that the breakpoint overwrote. 10638 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10639 ``queue_ptr`` the trap instruction. The associated 10640 queue is signalled to put it into the 10641 error state. When the queue is put in 10642 the error state, the waves executing 10643 dispatches on the queue will be 10644 terminated. 10645 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10646 as a no-operation. The trap handler 10647 is entered and immediately returns to 10648 continue execution of the wavefront. 10649 - If the debugger is enabled, causes 10650 the debug trap to be reported by the 10651 debugger and the wavefront is put in 10652 the halt state with the PC at the 10653 instruction. The debugger must 10654 increment the PC and resume the wave. 10655 reserved ``s_trap 0x04`` Reserved. 10656 reserved ``s_trap 0x05`` Reserved. 10657 reserved ``s_trap 0x06`` Reserved. 10658 reserved ``s_trap 0x07`` Reserved. 10659 reserved ``s_trap 0x08`` Reserved. 10660 reserved ``s_trap 0xfe`` Reserved. 10661 reserved ``s_trap 0xff`` Reserved. 10662 =================== =============== =============== ======================================= 10663 10664.. 10665 10666 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 10667 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table 10668 10669 =================== =============== ================ ================= ======================================= 10670 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description 10671 =================== =============== ================ ================= ======================================= 10672 reserved ``s_trap 0x00`` Reserved by hardware. 10673 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 10674 breakpoints. Causes wave to be halted 10675 with the PC at the trap instruction. 10676 The debugger is responsible to resume 10677 the wave, including the instruction 10678 that the breakpoint overwrote. 10679 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 10680 ``queue_ptr`` the trap instruction. The associated 10681 queue is signalled to put it into the 10682 error state. When the queue is put in 10683 the error state, the waves executing 10684 dispatches on the queue will be 10685 terminated. 10686 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 10687 as a no-operation. The trap handler 10688 is entered and immediately returns to 10689 continue execution of the wavefront. 10690 - If the debugger is enabled, causes 10691 the debug trap to be reported by the 10692 debugger and the wavefront is put in 10693 the halt state with the PC at the 10694 instruction. The debugger must 10695 increment the PC and resume the wave. 10696 reserved ``s_trap 0x04`` Reserved. 10697 reserved ``s_trap 0x05`` Reserved. 10698 reserved ``s_trap 0x06`` Reserved. 10699 reserved ``s_trap 0x07`` Reserved. 10700 reserved ``s_trap 0x08`` Reserved. 10701 reserved ``s_trap 0xfe`` Reserved. 10702 reserved ``s_trap 0xff`` Reserved. 10703 =================== =============== ================ ================= ======================================= 10704 10705.. _amdgpu-amdhsa-function-call-convention: 10706 10707Call Convention 10708~~~~~~~~~~~~~~~ 10709 10710.. note:: 10711 10712 This section is currently incomplete and has inaccuracies. It is WIP that will 10713 be updated as information is determined. 10714 10715See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 10716addresses. Unswizzled addresses are normal linear addresses. 10717 10718.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 10719 10720Kernel Functions 10721++++++++++++++++ 10722 10723This section describes the call convention ABI for the outer kernel function. 10724 10725See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 10726convention. 10727 10728The following is not part of the AMDGPU kernel calling convention but describes 10729how the AMDGPU implements function calls: 10730 107311. Clang decides the kernarg layout to match the *HSA Programmer's Language 10732 Reference* [HSA]_. 10733 10734 - All structs are passed directly. 10735 - Lambda values are passed *TBA*. 10736 10737 .. TODO:: 10738 10739 - Does this really follow HSA rules? Or are structs >16 bytes passed 10740 by-value struct? 10741 - What is ABI for lambda values? 10742 107434. The kernel performs certain setup in its prolog, as described in 10744 :ref:`amdgpu-amdhsa-kernel-prolog`. 10745 10746.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 10747 10748Non-Kernel Functions 10749++++++++++++++++++++ 10750 10751This section describes the call convention ABI for functions other than the 10752outer kernel function. 10753 10754If a kernel has function calls then scratch is always allocated and used for 10755the call stack which grows from low address to high address using the swizzled 10756scratch address space. 10757 10758On entry to a function: 10759 107601. SGPR0-3 contain a V# with the following properties (see 10761 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 10762 10763 * Base address pointing to the beginning of the wavefront scratch backing 10764 memory. 10765 * Swizzled with dword element size and stride of wavefront size elements. 10766 107672. The FLAT_SCRATCH register pair is setup. See 10768 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 107693. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 10770 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 107714. The EXEC register is set to the lanes active on entry to the function. 107725. MODE register: *TBD* 107736. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 10774 below. 107757. SGPR30-31 return address (RA). The code address that the function must 10776 return to when it completes. The value is undefined if the function is *no 10777 return*. 107788. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 10779 offset relative to the beginning of the wavefront scratch backing memory. 10780 10781 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 10782 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 10783 manner. 10784 10785 The unswizzled SP value can be converted into the swizzled SP value by: 10786 10787 | swizzled SP = unswizzled SP / wavefront size 10788 10789 This may be used to obtain the private address space address of stack 10790 objects and to convert this address to a flat address by adding the flat 10791 scratch aperture base address. 10792 10793 The swizzled SP value is always 4 bytes aligned for the ``r600`` 10794 architecture and 16 byte aligned for the ``amdgcn`` architecture. 10795 10796 .. note:: 10797 10798 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 10799 OpenCL language which has the largest base type defined as 16 bytes. 10800 10801 On entry, the swizzled SP value is the address of the first function 10802 argument passed on the stack. Other stack passed arguments are positive 10803 offsets from the entry swizzled SP value. 10804 10805 The function may use positive offsets beyond the last stack passed argument 10806 for stack allocated local variables and register spill slots. If necessary, 10807 the function may align these to greater alignment than 16 bytes. After these 10808 the function may dynamically allocate space for such things as runtime sized 10809 ``alloca`` local allocations. 10810 10811 If the function calls another function, it will place any stack allocated 10812 arguments after the last local allocation and adjust SGPR32 to the address 10813 after the last local allocation. 10814 108159. All other registers are unspecified. 1081610. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 10817 to the function. 10818 10819On exit from a function: 10820 108211. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 10822 described below. Any registers used are considered clobbered registers. 108232. The following registers are preserved and have the same value as on entry: 10824 10825 * FLAT_SCRATCH 10826 * EXEC 10827 * GFX6-GFX8: M0 10828 * All SGPR registers except the clobbered registers of SGPR4-31. 10829 * VGPR40-47 10830 * VGPR56-63 10831 * VGPR72-79 10832 * VGPR88-95 10833 * VGPR104-111 10834 * VGPR120-127 10835 * VGPR136-143 10836 * VGPR152-159 10837 * VGPR168-175 10838 * VGPR184-191 10839 * VGPR200-207 10840 * VGPR216-223 10841 * VGPR232-239 10842 * VGPR248-255 10843 10844 .. note:: 10845 10846 Except the argument registers, the VGPRs clobbered and the preserved 10847 registers are intermixed at regular intervals in order to keep a 10848 similar ratio independent of the number of allocated VGPRs. 10849 10850 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31. 10851 * Lanes of all VGPRs that are inactive at the call site. 10852 10853 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 10854 optimization may mark some of clobbered SGPR and VGPR registers as 10855 preserved if it can be determined that the called function does not change 10856 their value. 10857 108582. The PC is set to the RA provided on entry. 108593. MODE register: *TBD*. 108604. All other registers are clobbered. 108615. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 10862 function is available to the caller. 10863 10864.. TODO:: 10865 10866 - How are function results returned? The address of structured types is passed 10867 by reference, but what about other types? 10868 10869The function input arguments are made up of the formal arguments explicitly 10870declared by the source language function plus the implicit input arguments used 10871by the implementation. 10872 10873The source language input arguments are: 10874 108751. Any source language implicit ``this`` or ``self`` argument comes first as a 10876 pointer type. 108772. Followed by the function formal arguments in left to right source order. 10878 10879The source language result arguments are: 10880 108811. The function result argument. 10882 10883The source language input or result struct type arguments that are less than or 10884equal to 16 bytes, are decomposed recursively into their base type fields, and 10885each field is passed as if a separate argument. For input arguments, if the 10886called function requires the struct to be in memory, for example because its 10887address is taken, then the function body is responsible for allocating a stack 10888location and copying the field arguments into it. Clang terms this *direct 10889struct*. 10890 10891The source language input struct type arguments that are greater than 16 bytes, 10892are passed by reference. The caller is responsible for allocating a stack 10893location to make a copy of the struct value and pass the address as the input 10894argument. The called function is responsible to perform the dereference when 10895accessing the input argument. Clang terms this *by-value struct*. 10896 10897A source language result struct type argument that is greater than 16 bytes, is 10898returned by reference. The caller is responsible for allocating a stack location 10899to hold the result value and passes the address as the last input argument 10900(before the implicit input arguments). In this case there are no result 10901arguments. The called function is responsible to perform the dereference when 10902storing the result value. Clang terms this *structured return (sret)*. 10903 10904*TODO: correct the ``sret`` definition.* 10905 10906.. TODO:: 10907 10908 Is this definition correct? Or is ``sret`` only used if passing in registers, and 10909 pass as non-decomposed struct as stack argument? Or something else? Is the 10910 memory location in the caller stack frame, or a stack memory argument and so 10911 no address is passed as the caller can directly write to the argument stack 10912 location? But then the stack location is still live after return. If an 10913 argument stack location is it the first stack argument or the last one? 10914 10915Lambda argument types are treated as struct types with an implementation defined 10916set of fields. 10917 10918.. TODO:: 10919 10920 Need to specify the ABI for lambda types for AMDGPU. 10921 10922For AMDGPU backend all source language arguments (including the decomposed 10923struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 10924they are passed in SGPRs. 10925 10926The AMDGPU backend walks the function call graph from the leaves to determine 10927which implicit input arguments are used, propagating to each caller of the 10928function. The used implicit arguments are appended to the function arguments 10929after the source language arguments in the following order: 10930 10931.. TODO:: 10932 10933 Is recursion or external functions supported? 10934 109351. Work-Item ID (1 VGPR) 10936 10937 The X, Y and Z work-item ID are packed into a single VGRP with the following 10938 layout. Only fields actually used by the function are set. The other bits 10939 are undefined. 10940 10941 The values come from the initial kernel execution state. See 10942 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 10943 10944 .. table:: Work-item implicit argument layout 10945 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 10946 10947 ======= ======= ============== 10948 Bits Size Field Name 10949 ======= ======= ============== 10950 9:0 10 bits X Work-Item ID 10951 19:10 10 bits Y Work-Item ID 10952 29:20 10 bits Z Work-Item ID 10953 31:30 2 bits Unused 10954 ======= ======= ============== 10955 109562. Dispatch Ptr (2 SGPRs) 10957 10958 The value comes from the initial kernel execution state. See 10959 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10960 109613. Queue Ptr (2 SGPRs) 10962 10963 The value comes from the initial kernel execution state. See 10964 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10965 109664. Kernarg Segment Ptr (2 SGPRs) 10967 10968 The value comes from the initial kernel execution state. See 10969 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10970 109715. Dispatch id (2 SGPRs) 10972 10973 The value comes from the initial kernel execution state. See 10974 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10975 109766. Work-Group ID X (1 SGPR) 10977 10978 The value comes from the initial kernel execution state. See 10979 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10980 109817. Work-Group ID Y (1 SGPR) 10982 10983 The value comes from the initial kernel execution state. See 10984 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10985 109868. Work-Group ID Z (1 SGPR) 10987 10988 The value comes from the initial kernel execution state. See 10989 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10990 109919. Implicit Argument Ptr (2 SGPRs) 10992 10993 The value is computed by adding an offset to Kernarg Segment Ptr to get the 10994 global address space pointer to the first kernarg implicit argument. 10995 10996The input and result arguments are assigned in order in the following manner: 10997 10998.. note:: 10999 11000 There are likely some errors and omissions in the following description that 11001 need correction. 11002 11003 .. TODO:: 11004 11005 Check the Clang source code to decipher how function arguments and return 11006 results are handled. Also see the AMDGPU specific values used. 11007 11008* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 11009 VGPR31. 11010 11011 If there are more arguments than will fit in these registers, the remaining 11012 arguments are allocated on the stack in order on naturally aligned 11013 addresses. 11014 11015 .. TODO:: 11016 11017 How are overly aligned structures allocated on the stack? 11018 11019* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 11020 SGPR29. 11021 11022 If there are more arguments than will fit in these registers, the remaining 11023 arguments are allocated on the stack in order on naturally aligned 11024 addresses. 11025 11026Note that decomposed struct type arguments may have some fields passed in 11027registers and some in memory. 11028 11029.. TODO:: 11030 11031 So, a struct which can pass some fields as decomposed register arguments, will 11032 pass the rest as decomposed stack elements? But an argument that will not start 11033 in registers will not be decomposed and will be passed as a non-decomposed 11034 stack value? 11035 11036The following is not part of the AMDGPU function calling convention but 11037describes how the AMDGPU implements function calls: 11038 110391. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 11040 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 11041 are used, or for the reasons defined in ``SIFrameLowering``. 110422. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 11043 to access the incoming stack arguments in the function. The BP is needed 11044 only when the function requires the runtime stack alignment. 11045 110463. Allocating SGPR arguments on the stack are not supported. 11047 110484. No CFI is currently generated. See 11049 :ref:`amdgpu-dwarf-call-frame-information`. 11050 11051 .. note:: 11052 11053 CFI will be generated that defines the CFA as the unswizzled address 11054 relative to the wave scratch base in the unswizzled private address space 11055 of the lowest address stack allocated local variable. 11056 11057 ``DW_AT_frame_base`` will be defined as the swizzled address in the 11058 swizzled private address space by dividing the CFA by the wavefront size 11059 (since CFA is always at least dword aligned which matches the scratch 11060 swizzle element size). 11061 11062 If no dynamic stack alignment was performed, the stack allocated arguments 11063 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 11064 local variables and register spill slots are accessed as positive offsets 11065 relative to ``DW_AT_frame_base``. 11066 110675. Function argument passing is implemented by copying the input physical 11068 registers to virtual registers on entry. The register allocator can spill if 11069 necessary. These are copied back to physical registers at call sites. The 11070 net effect is that each function call can have these values in entirely 11071 distinct locations. The IPRA can help avoid shuffling argument registers. 110726. Call sites are implemented by setting up the arguments at positive offsets 11073 from SP. Then SP is incremented to account for the known frame size before 11074 the call and decremented after the call. 11075 11076 .. note:: 11077 11078 The CFI will reflect the changed calculation needed to compute the CFA 11079 from SP. 11080 110817. 4 byte spill slots are used in the stack frame. One slot is allocated for an 11082 emergency spill slot. Buffer instructions are used for stack accesses and 11083 not the ``flat_scratch`` instruction. 11084 11085 .. TODO:: 11086 11087 Explain when the emergency spill slot is used. 11088 11089.. TODO:: 11090 11091 Possible broken issues: 11092 11093 - Stack arguments must be aligned to required alignment. 11094 - Stack is aligned to max(16, max formal argument alignment) 11095 - Direct argument < 64 bits should check register budget. 11096 - Register budget calculation should respect ``inreg`` for SGPR. 11097 - SGPR overflow is not handled. 11098 - struct with 1 member unpeeling is not checking size of member. 11099 - ``sret`` is after ``this`` pointer. 11100 - Caller is not implementing stack realignment: need an extra pointer. 11101 - Should say AMDGPU passes FP rather than SP. 11102 - Should CFI define CFA as address of locals or arguments. Difference is 11103 apparent when have implemented dynamic alignment. 11104 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 11105 highest address of stack frame and use negative offset for locals. Would 11106 allow SP to be the same as FP and could support signal-handler-like as now 11107 have a real SP for the top of the stack. 11108 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 11109 arguments? 11110 11111AMDPAL 11112------ 11113 11114This section provides code conventions used when the target triple OS is 11115``amdpal`` (see :ref:`amdgpu-target-triples`). 11116 11117.. _amdgpu-amdpal-code-object-metadata-section: 11118 11119Code Object Metadata 11120~~~~~~~~~~~~~~~~~~~~ 11121 11122.. note:: 11123 11124 The metadata is currently in development and is subject to major 11125 changes. Only the current version is supported. *When this document 11126 was generated the version was 2.6.* 11127 11128Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 11129record (see :ref:`amdgpu-note-records-v3-v4`). 11130 11131The metadata is represented as Message Pack formatted binary data (see 11132[MsgPack]_). The top level is a Message Pack map that includes the keys 11133defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 11134and referenced tables. 11135 11136Additional information can be added to the maps. To avoid conflicts, any 11137key names should be prefixed by "*vendor-name*." where ``vendor-name`` 11138can be the name of the vendor and specific vendor tool that generates the 11139information. The prefix is abbreviated to simply "." when it appears 11140within a map that has been added by the same *vendor-name*. 11141 11142 .. table:: AMDPAL Code Object Metadata Map 11143 :name: amdgpu-amdpal-code-object-metadata-map-table 11144 11145 =================== ============== ========= ====================================================================== 11146 String Key Value Type Required? Description 11147 =================== ============== ========= ====================================================================== 11148 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 11149 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 11150 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 11151 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 11152 definition of the keys included in that map. 11153 =================== ============== ========= ====================================================================== 11154 11155.. 11156 11157 .. table:: AMDPAL Code Object Pipeline Metadata Map 11158 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 11159 11160 ====================================== ============== ========= =================================================== 11161 String Key Value Type Required? Description 11162 ====================================== ============== ========= =================================================== 11163 ".name" string Source name of the pipeline. 11164 ".type" string Pipeline type, e.g. VsPs. Values include: 11165 11166 - "VsPs" 11167 - "Gs" 11168 - "Cs" 11169 - "Ngg" 11170 - "Tess" 11171 - "GsTess" 11172 - "NggTess" 11173 11174 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 11175 2 integers 64 bits is the "stable" portion of the hash, used 11176 for e.g. shader replacement lookup. Upper 64 bits 11177 is the "unique" portion of the hash, used for 11178 e.g. pipeline cache lookup. The value is 11179 implementation defined, and can not be relied on 11180 between different builds of the compiler. 11181 ".shaders" map Per-API shader metadata. See 11182 :ref:`amdgpu-amdpal-code-object-shader-map-table` 11183 for the definition of the keys included in that 11184 map. 11185 ".hardware_stages" map Per-hardware stage metadata. See 11186 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 11187 for the definition of the keys included in that 11188 map. 11189 ".shader_functions" map Per-shader function metadata. See 11190 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 11191 for the definition of the keys included in that 11192 map. 11193 ".registers" map Required Hardware register configuration. See 11194 :ref:`amdgpu-amdpal-code-object-register-map-table` 11195 for the definition of the keys included in that 11196 map. 11197 ".user_data_limit" integer Number of user data entries accessed by this 11198 pipeline. 11199 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 11200 NoUserDataSpilling. 11201 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 11202 viewport array index feature. Pipelines which use 11203 this feature can render into all 16 viewports, 11204 whereas pipelines which do not use it are 11205 restricted to viewport #0. 11206 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 11207 handling data-passing between the ES and GS 11208 shader stages. This can be zero if the data is 11209 passed using off-chip buffers. This value should 11210 be used to program all user-SGPRs which have been 11211 marked with "UserDataMapping::EsGsLdsSize" 11212 (typically only the GS and VS HW stages will ever 11213 have a user-SGPR so marked). 11214 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 11215 (maximum number of threads in a subgroup). 11216 ".num_interpolants" integer Graphics only. Number of PS interpolants. 11217 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 11218 ".api" string Name of the client graphics API. 11219 ".api_create_info" binary Graphics API shader create info binary blob. Can 11220 be defined by the driver using the compiler if 11221 they want to be able to correlate API-specific 11222 information used during creation at a later time. 11223 ====================================== ============== ========= =================================================== 11224 11225.. 11226 11227 .. table:: AMDPAL Code Object Shader Map 11228 :name: amdgpu-amdpal-code-object-shader-map-table 11229 11230 11231 +-------------+--------------+-------------------------------------------------------------------+ 11232 |String Key |Value Type |Description | 11233 +=============+==============+===================================================================+ 11234 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 11235 |- ".vertex" | |for the definition of the keys included in that map. | 11236 |- ".hull" | | | 11237 |- ".domain" | | | 11238 |- ".geometry"| | | 11239 |- ".pixel" | | | 11240 +-------------+--------------+-------------------------------------------------------------------+ 11241 11242.. 11243 11244 .. table:: AMDPAL Code Object API Shader Metadata Map 11245 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 11246 11247 ==================== ============== ========= ===================================================================== 11248 String Key Value Type Required? Description 11249 ==================== ============== ========= ===================================================================== 11250 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 11251 2 integers is implementation defined, and can not be relied on between 11252 different builds of the compiler. 11253 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 11254 string include: 11255 11256 - ".ls" 11257 - ".hs" 11258 - ".es" 11259 - ".gs" 11260 - ".vs" 11261 - ".ps" 11262 - ".cs" 11263 11264 ==================== ============== ========= ===================================================================== 11265 11266.. 11267 11268 .. table:: AMDPAL Code Object Hardware Stage Map 11269 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 11270 11271 +-------------+--------------+-----------------------------------------------------------------------+ 11272 |String Key |Value Type |Description | 11273 +=============+==============+=======================================================================+ 11274 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 11275 |- ".hs" | |for the definition of the keys included in that map. | 11276 |- ".es" | | | 11277 |- ".gs" | | | 11278 |- ".vs" | | | 11279 |- ".ps" | | | 11280 |- ".cs" | | | 11281 +-------------+--------------+-----------------------------------------------------------------------+ 11282 11283.. 11284 11285 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 11286 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 11287 11288 ========================== ============== ========= =============================================================== 11289 String Key Value Type Required? Description 11290 ========================== ============== ========= =============================================================== 11291 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 11292 ".scratch_memory_size" integer Scratch memory size in bytes. 11293 ".lds_size" integer Local Data Share size in bytes. 11294 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 11295 ".vgpr_count" integer Number of VGPRs used. 11296 ".sgpr_count" integer Number of SGPRs used. 11297 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 11298 directive to instruct the compiler to limit the VGPR usage to 11299 be less than or equal to the specified value (only set if 11300 different from HW default). 11301 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 11302 default). 11303 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 11304 3 integers 11305 ".wavefront_size" integer Wavefront size (only set if different from HW default). 11306 ".uses_uavs" boolean The shader reads or writes UAVs. 11307 ".uses_rovs" boolean The shader reads or writes ROVs. 11308 ".writes_uavs" boolean The shader writes to one or more UAVs. 11309 ".writes_depth" boolean The shader writes out a depth value. 11310 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 11311 memory or GDS. 11312 ".uses_prim_id" boolean The shader uses PrimID. 11313 ========================== ============== ========= =============================================================== 11314 11315.. 11316 11317 .. table:: AMDPAL Code Object Shader Function Map 11318 :name: amdgpu-amdpal-code-object-shader-function-map-table 11319 11320 =============== ============== ==================================================================== 11321 String Key Value Type Description 11322 =============== ============== ==================================================================== 11323 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 11324 entry address. The value is the function's metadata. See 11325 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 11326 =============== ============== ==================================================================== 11327 11328.. 11329 11330 .. table:: AMDPAL Code Object Shader Function Metadata Map 11331 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 11332 11333 ============================= ============== ================================================================= 11334 String Key Value Type Description 11335 ============================= ============== ================================================================= 11336 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 11337 2 integers is implementation defined, and can not be relied on between 11338 different builds of the compiler. 11339 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 11340 ".lds_size" integer Size in bytes of LDS memory. 11341 ".vgpr_count" integer Number of VGPRs used by the shader. 11342 ".sgpr_count" integer Number of SGPRs used by the shader. 11343 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 11344 ".shader_subtype" string Shader subtype/kind. Values include: 11345 11346 - "Unknown" 11347 11348 ============================= ============== ================================================================= 11349 11350.. 11351 11352 .. table:: AMDPAL Code Object Register Map 11353 :name: amdgpu-amdpal-code-object-register-map-table 11354 11355 ========================== ============== ==================================================================== 11356 32-bit Integer Key Value Type Description 11357 ========================== ============== ==================================================================== 11358 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 11359 a GRBM register (i.e., driver accessible GPU register number, not 11360 shader GPR register number). The driver is required to program each 11361 specified register to the corresponding specified value when 11362 executing this pipeline. Typically, the ``reg offsets`` are the 11363 ``uint16_t`` offsets to each register as defined by the hardware 11364 chip headers. The register is set to the provided value. However, a 11365 ``reg offset`` that specifies a user data register (e.g., 11366 COMPUTE_USER_DATA_0) needs special treatment. See 11367 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 11368 information. 11369 ========================== ============== ==================================================================== 11370 11371.. _amdgpu-amdpal-code-object-user-data-section: 11372 11373User Data 11374+++++++++ 11375 11376Each hardware stage has a set of 32-bit physical SPI *user data registers* 11377(either 16 or 32 based on graphics IP and the stage) which can be 11378written from a command buffer and then loaded into SGPRs when waves are 11379launched via a subsequent dispatch or draw operation. This is the way 11380most arguments are passed from the application/runtime to a hardware 11381shader. 11382 11383PAL abstracts this functionality by exposing a set of 128 *user data 11384entries* per pipeline a client can use to pass arguments from a command 11385buffer to one or more shaders in that pipeline. The ELF code object must 11386specify a mapping from virtualized *user data entries* to physical *user 11387data registers*, and PAL is responsible for implementing that mapping, 11388including spilling overflow *user data entries* to memory if needed. 11389 11390Since the *user data registers* are GRBM-accessible SPI registers, this 11391mapping is actually embedded in the ``.registers`` metadata entry. For 11392most registers, the value in that map is a literal 32-bit value that 11393should be written to the register by the driver. However, when the 11394register is a *user data register* (any USER_DATA register e.g., 11395SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 11396the driver to write either a *user data entry* value or one of several 11397driver-internal values to the register. This encoding is described in 11398the following table: 11399 11400.. note:: 11401 11402 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 11403 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 11404 always be programmed to the address of the GlobalTable, and *user data 11405 register* 1 must always be programmed to the address of the PerShaderTable. 11406 11407.. 11408 11409 .. table:: AMDPAL User Data Mapping 11410 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 11411 11412 ========== ================= =============================================================================== 11413 Value Name Description 11414 ========== ================= =============================================================================== 11415 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 11416 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 11417 always point to *user data register* 0). 11418 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 11419 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 11420 for more detail (should always point to *user data register* 1). 11421 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 11422 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 11423 more detail. 11424 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 11425 reference the draw index in the vertex shader. Only supported by the first 11426 stage in a graphics pipeline. 11427 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 11428 a graphics pipeline. 11429 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 11430 graphics pipeline. 11431 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 11432 a buffer containing the grid dimensions for a Compute dispatch operation. The 11433 high half of the address is stored in the next sequential user-SGPR. Only 11434 supported by compute pipelines. 11435 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 11436 space used for the ES/GS pseudo-ring-buffer for passing data between shader 11437 stages. 11438 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 11439 pipeline instancing. 11440 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 11441 can only appear for one shader stage per pipeline. 11442 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 11443 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 11444 only appear for one shader stage per pipeline. 11445 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 11446 only appear for one shader stage per pipeline (PS). These replace color targets 11447 and are completely separate from any UAVs used by the shader. This is optional, 11448 and only used by the PS when UAV exports are used to replace color-target 11449 exports to optimize specific shaders. 11450 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 11451 some NGG pipelines to perform culling. This value contains the address of the 11452 first of two consecutive registers which provide the full GPU address. 11453 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 11454 ========== ================= =============================================================================== 11455 11456.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 11457 11458Per-Shader Table 11459################ 11460 11461Low 32 bits of the GPU address for an optional buffer in the ``.data`` 11462section of the ELF. The high 32 bits of the address match the high 32 bits 11463of the shader's program counter. 11464 11465The buffer can be anything the shader compiler needs it for, and 11466allows each shader to have its own region of the ``.data`` section. 11467Typically, this could be a table of buffer SRD's and the data pointed to 11468by the buffer SRD's, but it could be a flat-address region of memory as 11469well. Its layout and usage are defined by the shader compiler. 11470 11471Each shader's table in the ``.data`` section is referenced by the symbol 11472``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 11473hardware shader stage the data is for. E.g., 11474``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 11475 11476.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 11477 11478Spill Table 11479########### 11480 11481It is possible for a hardware shader to need access to more *user data 11482entries* than there are slots available in user data registers for one 11483or more hardware shader stages. In that case, the PAL runtime expects 11484the necessary *user data entries* to be spilled to GPU memory and use 11485one user data register to point to the spilled user data memory. The 11486value of the *user data entry* must then represent the location where 11487a shader expects to read the low 32-bits of the table's GPU virtual 11488address. The *spill table* itself represents a set of 32-bit values 11489managed by the PAL runtime in GPU-accessible memory that can be made 11490indirectly accessible to a hardware shader. 11491 11492Unspecified OS 11493-------------- 11494 11495This section provides code conventions used when the target triple OS is 11496empty (see :ref:`amdgpu-target-triples`). 11497 11498Trap Handler ABI 11499~~~~~~~~~~~~~~~~ 11500 11501For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 11502not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 11503instructions are handled as follows: 11504 11505 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 11506 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 11507 11508 =============== =============== =========================================== 11509 Usage Code Sequence Description 11510 =============== =============== =========================================== 11511 llvm.trap s_endpgm Causes wavefront to be terminated. 11512 llvm.debugtrap *none* Compiler warning given that there is no 11513 trap handler installed. 11514 =============== =============== =========================================== 11515 11516Source Languages 11517================ 11518 11519.. _amdgpu-opencl: 11520 11521OpenCL 11522------ 11523 11524When the language is OpenCL the following differences occur: 11525 115261. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 115272. The AMDGPU backend appends additional arguments to the kernel's explicit 11528 arguments for the AMDHSA OS (see 11529 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 115303. Additional metadata is generated 11531 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 11532 11533 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 11534 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 11535 11536 ======== ==== ========= =========================================== 11537 Position Byte Byte Description 11538 Size Alignment 11539 ======== ==== ========= =========================================== 11540 1 8 8 OpenCL Global Offset X 11541 2 8 8 OpenCL Global Offset Y 11542 3 8 8 OpenCL Global Offset Z 11543 4 8 8 OpenCL address of printf buffer 11544 5 8 8 OpenCL address of virtual queue used by 11545 enqueue_kernel. 11546 6 8 8 OpenCL address of AqlWrap struct used by 11547 enqueue_kernel. 11548 7 8 8 Pointer argument used for Multi-gird 11549 synchronization. 11550 ======== ==== ========= =========================================== 11551 11552.. _amdgpu-hcc: 11553 11554HCC 11555--- 11556 11557When the language is HCC the following differences occur: 11558 115591. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 11560 11561.. _amdgpu-assembler: 11562 11563Assembler 11564--------- 11565 11566AMDGPU backend has LLVM-MC based assembler which is currently in development. 11567It supports AMDGCN GFX6-GFX10. 11568 11569This section describes general syntax for instructions and operands. 11570 11571Instructions 11572~~~~~~~~~~~~ 11573 11574An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 11575 11576 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 11577 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 11578 11579:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 11580:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 11581 11582The order of operands and modifiers is fixed. 11583Most modifiers are optional and may be omitted. 11584 11585Links to detailed instruction syntax description may be found in the following 11586table. Note that features under development are not included 11587in this description. 11588 11589 =================================== ======================================= 11590 Core ISA ISA Extensions 11591 =================================== ======================================= 11592 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 11593 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 11594 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 11595 11596 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 11597 11598 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 11599 11600 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 11601 11602 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 11603 11604 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 11605 11606 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 11607 11608 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 11609 11610 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 11611 =================================== ======================================= 11612 11613For more information about instructions, their semantics and supported 11614combinations of operands, refer to one of instruction set architecture manuals 11615[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 11616[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_ 11617[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_. 11618 11619Operands 11620~~~~~~~~ 11621 11622Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 11623 11624Modifiers 11625~~~~~~~~~ 11626 11627Detailed description of modifiers may be found 11628:doc:`here<AMDGPUModifierSyntax>`. 11629 11630Instruction Examples 11631~~~~~~~~~~~~~~~~~~~~ 11632 11633DS 11634++ 11635 11636.. code-block:: nasm 11637 11638 ds_add_u32 v2, v4 offset:16 11639 ds_write_src2_b64 v2 offset0:4 offset1:8 11640 ds_cmpst_f32 v2, v4, v6 11641 ds_min_rtn_f64 v[8:9], v2, v[4:5] 11642 11643For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 11644Manual. 11645 11646FLAT 11647++++ 11648 11649.. code-block:: nasm 11650 11651 flat_load_dword v1, v[3:4] 11652 flat_store_dwordx3 v[3:4], v[5:7] 11653 flat_atomic_swap v1, v[3:4], v5 glc 11654 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 11655 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 11656 11657For full list of supported instructions, refer to "FLAT instructions" in ISA 11658Manual. 11659 11660MUBUF 11661+++++ 11662 11663.. code-block:: nasm 11664 11665 buffer_load_dword v1, off, s[4:7], s1 11666 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 11667 buffer_store_format_xy v[1:2], off, s[4:7], s1 11668 buffer_wbinvl1 11669 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 11670 11671For full list of supported instructions, refer to "MUBUF Instructions" in ISA 11672Manual. 11673 11674SMRD/SMEM 11675+++++++++ 11676 11677.. code-block:: nasm 11678 11679 s_load_dword s1, s[2:3], 0xfc 11680 s_load_dwordx8 s[8:15], s[2:3], s4 11681 s_load_dwordx16 s[88:103], s[2:3], s4 11682 s_dcache_inv_vol 11683 s_memtime s[4:5] 11684 11685For full list of supported instructions, refer to "Scalar Memory Operations" in 11686ISA Manual. 11687 11688SOP1 11689++++ 11690 11691.. code-block:: nasm 11692 11693 s_mov_b32 s1, s2 11694 s_mov_b64 s[0:1], 0x80000000 11695 s_cmov_b32 s1, 200 11696 s_wqm_b64 s[2:3], s[4:5] 11697 s_bcnt0_i32_b64 s1, s[2:3] 11698 s_swappc_b64 s[2:3], s[4:5] 11699 s_cbranch_join s[4:5] 11700 11701For full list of supported instructions, refer to "SOP1 Instructions" in ISA 11702Manual. 11703 11704SOP2 11705++++ 11706 11707.. code-block:: nasm 11708 11709 s_add_u32 s1, s2, s3 11710 s_and_b64 s[2:3], s[4:5], s[6:7] 11711 s_cselect_b32 s1, s2, s3 11712 s_andn2_b32 s2, s4, s6 11713 s_lshr_b64 s[2:3], s[4:5], s6 11714 s_ashr_i32 s2, s4, s6 11715 s_bfm_b64 s[2:3], s4, s6 11716 s_bfe_i64 s[2:3], s[4:5], s6 11717 s_cbranch_g_fork s[4:5], s[6:7] 11718 11719For full list of supported instructions, refer to "SOP2 Instructions" in ISA 11720Manual. 11721 11722SOPC 11723++++ 11724 11725.. code-block:: nasm 11726 11727 s_cmp_eq_i32 s1, s2 11728 s_bitcmp1_b32 s1, s2 11729 s_bitcmp0_b64 s[2:3], s4 11730 s_setvskip s3, s5 11731 11732For full list of supported instructions, refer to "SOPC Instructions" in ISA 11733Manual. 11734 11735SOPP 11736++++ 11737 11738.. code-block:: nasm 11739 11740 s_barrier 11741 s_nop 2 11742 s_endpgm 11743 s_waitcnt 0 ; Wait for all counters to be 0 11744 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 11745 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 11746 s_sethalt 9 11747 s_sleep 10 11748 s_sendmsg 0x1 11749 s_sendmsg sendmsg(MSG_INTERRUPT) 11750 s_trap 1 11751 11752For full list of supported instructions, refer to "SOPP Instructions" in ISA 11753Manual. 11754 11755Unless otherwise mentioned, little verification is performed on the operands 11756of SOPP Instructions, so it is up to the programmer to be familiar with the 11757range or acceptable values. 11758 11759VALU 11760++++ 11761 11762For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 11763the assembler will automatically use optimal encoding based on its operands. To 11764force specific encoding, one can add a suffix to the opcode of the instruction: 11765 11766* _e32 for 32-bit VOP1/VOP2/VOPC 11767* _e64 for 64-bit VOP3 11768* _dpp for VOP_DPP 11769* _sdwa for VOP_SDWA 11770 11771VOP1/VOP2/VOP3/VOPC examples: 11772 11773.. code-block:: nasm 11774 11775 v_mov_b32 v1, v2 11776 v_mov_b32_e32 v1, v2 11777 v_nop 11778 v_cvt_f64_i32_e32 v[1:2], v2 11779 v_floor_f32_e32 v1, v2 11780 v_bfrev_b32_e32 v1, v2 11781 v_add_f32_e32 v1, v2, v3 11782 v_mul_i32_i24_e64 v1, v2, 3 11783 v_mul_i32_i24_e32 v1, -3, v3 11784 v_mul_i32_i24_e32 v1, -100, v3 11785 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 11786 v_max_f16_e32 v1, v2, v3 11787 11788VOP_DPP examples: 11789 11790.. code-block:: nasm 11791 11792 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 11793 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11794 v_mov_b32 v0, v0 wave_shl:1 11795 v_mov_b32 v0, v0 row_mirror 11796 v_mov_b32 v0, v0 row_bcast:31 11797 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 11798 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11799 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11800 11801VOP_SDWA examples: 11802 11803.. code-block:: nasm 11804 11805 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 11806 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 11807 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 11808 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 11809 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 11810 11811For full list of supported instructions, refer to "Vector ALU instructions". 11812 11813.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 11814 11815Code Object V2 Predefined Symbols 11816~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11817 11818.. warning:: 11819 Code object V2 is not the default code object version emitted by 11820 this version of LLVM. 11821 11822The AMDGPU assembler defines and updates some symbols automatically. These 11823symbols do not affect code generation. 11824 11825.option.machine_version_major 11826+++++++++++++++++++++++++++++ 11827 11828Set to the GFX major generation number of the target being assembled for. For 11829example, when assembling for a "GFX9" target this will be set to the integer 11830value "9". The possible GFX major generation numbers are presented in 11831:ref:`amdgpu-processors`. 11832 11833.option.machine_version_minor 11834+++++++++++++++++++++++++++++ 11835 11836Set to the GFX minor generation number of the target being assembled for. For 11837example, when assembling for a "GFX810" target this will be set to the integer 11838value "1". The possible GFX minor generation numbers are presented in 11839:ref:`amdgpu-processors`. 11840 11841.option.machine_version_stepping 11842++++++++++++++++++++++++++++++++ 11843 11844Set to the GFX stepping generation number of the target being assembled for. 11845For example, when assembling for a "GFX704" target this will be set to the 11846integer value "4". The possible GFX stepping generation numbers are presented 11847in :ref:`amdgpu-processors`. 11848 11849.kernel.vgpr_count 11850++++++++++++++++++ 11851 11852Set to zero each time a 11853:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11854encountered. At each instruction, if the current value of this symbol is less 11855than or equal to the maximum VGPR number explicitly referenced within that 11856instruction then the symbol value is updated to equal that VGPR number plus 11857one. 11858 11859.kernel.sgpr_count 11860++++++++++++++++++ 11861 11862Set to zero each time a 11863:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11864encountered. At each instruction, if the current value of this symbol is less 11865than or equal to the maximum VGPR number explicitly referenced within that 11866instruction then the symbol value is updated to equal that SGPR number plus 11867one. 11868 11869.. _amdgpu-amdhsa-assembler-directives-v2: 11870 11871Code Object V2 Directives 11872~~~~~~~~~~~~~~~~~~~~~~~~~ 11873 11874.. warning:: 11875 Code object V2 is not the default code object version emitted by 11876 this version of LLVM. 11877 11878AMDGPU ABI defines auxiliary data in output code object. In assembly source, 11879one can specify them with assembler directives. 11880 11881.hsa_code_object_version major, minor 11882+++++++++++++++++++++++++++++++++++++ 11883 11884*major* and *minor* are integers that specify the version of the HSA code 11885object that will be generated by the assembler. 11886 11887.hsa_code_object_isa [major, minor, stepping, vendor, arch] 11888+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 11889 11890 11891*major*, *minor*, and *stepping* are all integers that describe the instruction 11892set architecture (ISA) version of the assembly program. 11893 11894*vendor* and *arch* are quoted strings. *vendor* should always be equal to 11895"AMD" and *arch* should always be equal to "AMDGPU". 11896 11897By default, the assembler will derive the ISA version, *vendor*, and *arch* 11898from the value of the -mcpu option that is passed to the assembler. 11899 11900.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 11901 11902.amdgpu_hsa_kernel (name) 11903+++++++++++++++++++++++++ 11904 11905This directives specifies that the symbol with given name is a kernel entry 11906point (label) and the object should contain corresponding symbol of type 11907STT_AMDGPU_HSA_KERNEL. 11908 11909.amd_kernel_code_t 11910++++++++++++++++++ 11911 11912This directive marks the beginning of a list of key / value pairs that are used 11913to specify the amd_kernel_code_t object that will be emitted by the assembler. 11914The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 11915amd_kernel_code_t values that are unspecified a default value will be used. The 11916default value for all keys is 0, with the following exceptions: 11917 11918- *amd_code_version_major* defaults to 1. 11919- *amd_kernel_code_version_minor* defaults to 2. 11920- *amd_machine_kind* defaults to 1. 11921- *amd_machine_version_major*, *machine_version_minor*, and 11922 *amd_machine_version_stepping* are derived from the value of the -mcpu option 11923 that is passed to the assembler. 11924- *kernel_code_entry_byte_offset* defaults to 256. 11925- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 11926 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 11927 Note that wavefront size is specified as a power of two, so a value of **n** 11928 means a size of 2^ **n**. 11929- *call_convention* defaults to -1. 11930- *kernarg_segment_alignment*, *group_segment_alignment*, and 11931 *private_segment_alignment* default to 4. Note that alignments are specified 11932 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 11933- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 11934 GFX90A onwards. 11935- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 11936 GFX10 onwards. 11937- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 11938 11939The *.amd_kernel_code_t* directive must be placed immediately after the 11940function label and before any instructions. 11941 11942For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 11943comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 11944 11945.. _amdgpu-amdhsa-assembler-example-v2: 11946 11947Code Object V2 Example Source Code 11948~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11949 11950.. warning:: 11951 Code Object V2 is not the default code object version emitted by 11952 this version of LLVM. 11953 11954Here is an example of a minimal assembly source file, defining one HSA kernel: 11955 11956.. code:: 11957 :number-lines: 11958 11959 .hsa_code_object_version 1,0 11960 .hsa_code_object_isa 11961 11962 .hsatext 11963 .globl hello_world 11964 .p2align 8 11965 .amdgpu_hsa_kernel hello_world 11966 11967 hello_world: 11968 11969 .amd_kernel_code_t 11970 enable_sgpr_kernarg_segment_ptr = 1 11971 is_ptr64 = 1 11972 compute_pgm_rsrc1_vgprs = 0 11973 compute_pgm_rsrc1_sgprs = 0 11974 compute_pgm_rsrc2_user_sgpr = 2 11975 compute_pgm_rsrc1_wgp_mode = 0 11976 compute_pgm_rsrc1_mem_ordered = 0 11977 compute_pgm_rsrc1_fwd_progress = 1 11978 .end_amd_kernel_code_t 11979 11980 s_load_dwordx2 s[0:1], s[0:1] 0x0 11981 v_mov_b32 v0, 3.14159 11982 s_waitcnt lgkmcnt(0) 11983 v_mov_b32 v1, s0 11984 v_mov_b32 v2, s1 11985 flat_store_dword v[1:2], v0 11986 s_endpgm 11987 .Lfunc_end0: 11988 .size hello_world, .Lfunc_end0-hello_world 11989 11990.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4: 11991 11992Code Object V3 to V4 Predefined Symbols 11993~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11994 11995The AMDGPU assembler defines and updates some symbols automatically. These 11996symbols do not affect code generation. 11997 11998.amdgcn.gfx_generation_number 11999+++++++++++++++++++++++++++++ 12000 12001Set to the GFX major generation number of the target being assembled for. For 12002example, when assembling for a "GFX9" target this will be set to the integer 12003value "9". The possible GFX major generation numbers are presented in 12004:ref:`amdgpu-processors`. 12005 12006.amdgcn.gfx_generation_minor 12007++++++++++++++++++++++++++++ 12008 12009Set to the GFX minor generation number of the target being assembled for. For 12010example, when assembling for a "GFX810" target this will be set to the integer 12011value "1". The possible GFX minor generation numbers are presented in 12012:ref:`amdgpu-processors`. 12013 12014.amdgcn.gfx_generation_stepping 12015+++++++++++++++++++++++++++++++ 12016 12017Set to the GFX stepping generation number of the target being assembled for. 12018For example, when assembling for a "GFX704" target this will be set to the 12019integer value "4". The possible GFX stepping generation numbers are presented 12020in :ref:`amdgpu-processors`. 12021 12022.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 12023 12024.amdgcn.next_free_vgpr 12025++++++++++++++++++++++ 12026 12027Set to zero before assembly begins. At each instruction, if the current value 12028of this symbol is less than or equal to the maximum VGPR number explicitly 12029referenced within that instruction then the symbol value is updated to equal 12030that VGPR number plus one. 12031 12032May be used to set the `.amdhsa_next_free_vgpr` directive in 12033:ref:`amdhsa-kernel-directives-table`. 12034 12035May be set at any time, e.g. manually set to zero at the start of each kernel. 12036 12037.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 12038 12039.amdgcn.next_free_sgpr 12040++++++++++++++++++++++ 12041 12042Set to zero before assembly begins. At each instruction, if the current value 12043of this symbol is less than or equal the maximum SGPR number explicitly 12044referenced within that instruction then the symbol value is updated to equal 12045that SGPR number plus one. 12046 12047May be used to set the `.amdhsa_next_free_spgr` directive in 12048:ref:`amdhsa-kernel-directives-table`. 12049 12050May be set at any time, e.g. manually set to zero at the start of each kernel. 12051 12052.. _amdgpu-amdhsa-assembler-directives-v3-v4: 12053 12054Code Object V3 to V4 Directives 12055~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12056 12057Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 12058architecture processors, and are not OS-specific. Directives which begin with 12059``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 12060``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 12061:ref:`amdgpu-processors`. 12062 12063.. _amdgpu-assembler-directive-amdgcn-target: 12064 12065.amdgcn_target <target-triple> "-" <target-id> 12066++++++++++++++++++++++++++++++++++++++++++++++ 12067 12068Optional directive which declares the ``<target-triple>-<target-id>`` supported 12069by the containing assembler source file. Used by the assembler to validate 12070command-line options such as ``-triple``, ``-mcpu``, and 12071``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 12072:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 12073 12074.. note:: 12075 12076 The target ID syntax used for code object V2 to V3 for this directive differs 12077 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 12078 12079.amdhsa_kernel <name> 12080+++++++++++++++++++++ 12081 12082Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 12083``<name>.kd``, in the current location of the current section. Only valid when 12084the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 12085instruction to execute, and does not need to be previously defined. 12086 12087Marks the beginning of a list of directives used to generate the bytes of a 12088kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 12089Directives which may appear in this list are described in 12090:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 12091be valid for the target being assembled for, and cannot be repeated. Directives 12092support the range of values specified by the field they reference in 12093:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 12094assumed to have its default value, unless it is marked as "Required", in which 12095case it is an error to omit the directive. This list of directives is 12096terminated by an ``.end_amdhsa_kernel`` directive. 12097 12098 .. table:: AMDHSA Kernel Assembler Directives 12099 :name: amdhsa-kernel-directives-table 12100 12101 ======================================================== =================== ============ =================== 12102 Directive Default Supported On Description 12103 ======================================================== =================== ============ =================== 12104 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 12105 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12106 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 12107 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12108 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 12109 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12110 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 12111 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12112 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 12113 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12114 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 12115 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12116 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 12117 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12118 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 12119 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12120 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 12121 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12122 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 12123 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12124 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 12125 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12126 Specific 12127 (wavefrontsize64) 12128 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 12129 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12130 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 12131 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12132 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 12133 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12134 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 12135 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12136 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 12137 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12138 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 12139 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12140 Possible values are defined in 12141 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 12142 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 12143 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 12144 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12145 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 12146 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12147 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12148 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file. 12149 Used to calculate ACCUM_OFFSET in 12150 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 12151 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 12152 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12153 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12154 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 12155 scratch memory. Used to calculate 12156 GRANULATED_WAVEFRONT_SGPR_COUNT in 12157 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12158 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 12159 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12160 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12161 (xnack) 12162 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 12163 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12164 Possible values are defined in 12165 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 12166 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 12167 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12168 Possible values are defined in 12169 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 12170 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 12171 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12172 Possible values are defined in 12173 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 12174 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 12175 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12176 Possible values are defined in 12177 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 12178 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 12179 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12180 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 12181 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12182 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 12183 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12184 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in 12185 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 12186 Specific 12187 (tgsplit) 12188 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 12189 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12190 Specific 12191 (cumode) 12192 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 12193 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12194 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 12195 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12196 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 12197 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12198 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 12199 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12200 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 12201 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12202 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 12203 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12204 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 12205 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12206 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 12207 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12208 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 12209 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12210 ======================================================== =================== ============ =================== 12211 12212.amdgpu_metadata 12213++++++++++++++++ 12214 12215Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 12216note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`). 12217 12218The contents must be in the [YAML]_ markup format, with the same structure and 12219semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or 12220:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 12221 12222This directive is terminated by an ``.end_amdgpu_metadata`` directive. 12223 12224.. _amdgpu-amdhsa-assembler-example-v3-v4: 12225 12226Code Object V3 to V4 Example Source Code 12227~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12228 12229Here is an example of a minimal assembly source file, defining one HSA kernel: 12230 12231.. code:: 12232 :number-lines: 12233 12234 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12235 12236 .text 12237 .globl hello_world 12238 .p2align 8 12239 .type hello_world,@function 12240 hello_world: 12241 s_load_dwordx2 s[0:1], s[0:1] 0x0 12242 v_mov_b32 v0, 3.14159 12243 s_waitcnt lgkmcnt(0) 12244 v_mov_b32 v1, s0 12245 v_mov_b32 v2, s1 12246 flat_store_dword v[1:2], v0 12247 s_endpgm 12248 .Lfunc_end0: 12249 .size hello_world, .Lfunc_end0-hello_world 12250 12251 .rodata 12252 .p2align 6 12253 .amdhsa_kernel hello_world 12254 .amdhsa_user_sgpr_kernarg_segment_ptr 1 12255 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12256 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12257 .end_amdhsa_kernel 12258 12259 .amdgpu_metadata 12260 --- 12261 amdhsa.version: 12262 - 1 12263 - 0 12264 amdhsa.kernels: 12265 - .name: hello_world 12266 .symbol: hello_world.kd 12267 .kernarg_segment_size: 48 12268 .group_segment_fixed_size: 0 12269 .private_segment_fixed_size: 0 12270 .kernarg_segment_align: 4 12271 .wavefront_size: 64 12272 .sgpr_count: 2 12273 .vgpr_count: 3 12274 .max_flat_workgroup_size: 256 12275 .args: 12276 - .size: 8 12277 .offset: 0 12278 .value_kind: global_buffer 12279 .address_space: global 12280 .actual_access: write_only 12281 //... 12282 .end_amdgpu_metadata 12283 12284This kernel is equivalent to the following HIP program: 12285 12286.. code:: 12287 :number-lines: 12288 12289 __global__ void hello_world(float *p) { 12290 *p = 3.14159f; 12291 } 12292 12293If an assembly source file contains multiple kernels and/or functions, the 12294:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 12295:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 12296the ``.set <symbol>, <expression>`` directive. For example, in the case of two 12297kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 12298to group the function with the kernel that calls it and reset the symbols 12299between the two connected components: 12300 12301.. code:: 12302 :number-lines: 12303 12304 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12305 12306 // gpr tracking symbols are implicitly set to zero 12307 12308 .text 12309 .globl kern0 12310 .p2align 8 12311 .type kern0,@function 12312 kern0: 12313 // ... 12314 s_endpgm 12315 .Lkern0_end: 12316 .size kern0, .Lkern0_end-kern0 12317 12318 .rodata 12319 .p2align 6 12320 .amdhsa_kernel kern0 12321 // ... 12322 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12323 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12324 .end_amdhsa_kernel 12325 12326 // reset symbols to begin tracking usage in func1 and kern1 12327 .set .amdgcn.next_free_vgpr, 0 12328 .set .amdgcn.next_free_sgpr, 0 12329 12330 .text 12331 .hidden func1 12332 .global func1 12333 .p2align 2 12334 .type func1,@function 12335 func1: 12336 // ... 12337 s_setpc_b64 s[30:31] 12338 .Lfunc1_end: 12339 .size func1, .Lfunc1_end-func1 12340 12341 .globl kern1 12342 .p2align 8 12343 .type kern1,@function 12344 kern1: 12345 // ... 12346 s_getpc_b64 s[4:5] 12347 s_add_u32 s4, s4, func1@rel32@lo+4 12348 s_addc_u32 s5, s5, func1@rel32@lo+4 12349 s_swappc_b64 s[30:31], s[4:5] 12350 // ... 12351 s_endpgm 12352 .Lkern1_end: 12353 .size kern1, .Lkern1_end-kern1 12354 12355 .rodata 12356 .p2align 6 12357 .amdhsa_kernel kern1 12358 // ... 12359 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12360 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12361 .end_amdhsa_kernel 12362 12363These symbols cannot identify connected components in order to automatically 12364track the usage for each kernel. However, in some cases careful organization of 12365the kernels and functions in the source file means there is minimal additional 12366effort required to accurately calculate GPR usage. 12367 12368Additional Documentation 12369======================== 12370 12371.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 12372.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 12373.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 12374.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 12375.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 12376.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 12377.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 12378.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 12379.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 12380.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 12381.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 12382.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 12383.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 12384.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 12385.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 12386.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 12387.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 12388.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 12389.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 12390.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 12391.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 12392.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 12393.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 12394.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 12395