1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX10 20 AMDGPU/AMDGPUAsmGFX1011 21 AMDGPUModifierSyntax 22 AMDGPUOperandSyntax 23 AMDGPUInstructionSyntax 24 AMDGPUInstructionNotation 25 AMDGPUDwarfExtensionsForHeterogeneousDebugging 26 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack 27 28Introduction 29============ 30 31The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 32R600 family up until the current GCN families. It lives in the 33``llvm/lib/Target/AMDGPU`` directory. 34 35LLVM 36==== 37 38.. _amdgpu-target-triples: 39 40Target Triples 41-------------- 42 43Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 44to specify the target triple: 45 46 .. table:: AMDGPU Architectures 47 :name: amdgpu-architecture-table 48 49 ============ ============================================================== 50 Architecture Description 51 ============ ============================================================== 52 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 53 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 54 ============ ============================================================== 55 56 .. table:: AMDGPU Vendors 57 :name: amdgpu-vendor-table 58 59 ============ ============================================================== 60 Vendor Description 61 ============ ============================================================== 62 ``amd`` Can be used for all AMD GPU usage. 63 ``mesa3d`` Can be used if the OS is ``mesa3d``. 64 ============ ============================================================== 65 66 .. table:: AMDGPU Operating Systems 67 :name: amdgpu-os 68 69 ============== ============================================================ 70 OS Description 71 ============== ============================================================ 72 *<empty>* Defaults to the *unknown* OS. 73 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 74 such as: 75 76 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 77 loader on Linux. See *AMD ROCm Platform Release Notes* 78 [AMD-ROCm-Release-Notes]_ for supported hardware and 79 software. 80 - AMD's PAL runtime using the *pal-amdhsa* loader on 81 Windows. 82 83 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 84 runtime using the *pal-amdpal* loader on Windows and Linux 85 Pro. 86 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 87 3D runtime using the *mesa-mesa3d* loader on Linux. 88 ============== ============================================================ 89 90 .. table:: AMDGPU Environments 91 :name: amdgpu-environment-table 92 93 ============ ============================================================== 94 Environment Description 95 ============ ============================================================== 96 *<empty>* Default. 97 ============ ============================================================== 98 99.. _amdgpu-processors: 100 101Processors 102---------- 103 104Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 105specify the AMDGPU processor together with optional target features. See 106:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 107specific information. 108 109Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 110 111* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 112 113 114 .. table:: AMDGPU Processors 115 :name: amdgpu-processor-table 116 117 =========== =============== ============ ===== ================= =============== =============== ====================== 118 Processor Alternative Target dGPU/ Target Target OS Support Example 119 Processor Triple APU Features Properties *(see* Products 120 Architecture Supported `amdgpu-os`_ 121 *and 122 corresponding 123 runtime release 124 notes for 125 current 126 information and 127 level of 128 support)* 129 =========== =============== ============ ===== ================= =============== =============== ====================== 130 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 131 ----------------------------------------------------------------------------------------------------------------------- 132 ``r600`` ``r600`` dGPU - Does not 133 support 134 generic 135 address 136 space 137 ``r630`` ``r600`` dGPU - Does not 138 support 139 generic 140 address 141 space 142 ``rs880`` ``r600`` dGPU - Does not 143 support 144 generic 145 address 146 space 147 ``rv670`` ``r600`` dGPU - Does not 148 support 149 generic 150 address 151 space 152 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 153 ----------------------------------------------------------------------------------------------------------------------- 154 ``rv710`` ``r600`` dGPU - Does not 155 support 156 generic 157 address 158 space 159 ``rv730`` ``r600`` dGPU - Does not 160 support 161 generic 162 address 163 space 164 ``rv770`` ``r600`` dGPU - Does not 165 support 166 generic 167 address 168 space 169 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 170 ----------------------------------------------------------------------------------------------------------------------- 171 ``cedar`` ``r600`` dGPU - Does not 172 support 173 generic 174 address 175 space 176 ``cypress`` ``r600`` dGPU - Does not 177 support 178 generic 179 address 180 space 181 ``juniper`` ``r600`` dGPU - Does not 182 support 183 generic 184 address 185 space 186 ``redwood`` ``r600`` dGPU - Does not 187 support 188 generic 189 address 190 space 191 ``sumo`` ``r600`` dGPU - Does not 192 support 193 generic 194 address 195 space 196 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 197 ----------------------------------------------------------------------------------------------------------------------- 198 ``barts`` ``r600`` dGPU - Does not 199 support 200 generic 201 address 202 space 203 ``caicos`` ``r600`` dGPU - Does not 204 support 205 generic 206 address 207 space 208 ``cayman`` ``r600`` dGPU - Does not 209 support 210 generic 211 address 212 space 213 ``turks`` ``r600`` dGPU - Does not 214 support 215 generic 216 address 217 space 218 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 219 ----------------------------------------------------------------------------------------------------------------------- 220 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 221 support 222 generic 223 address 224 space 225 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 226 - ``verde`` support 227 generic 228 address 229 space 230 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 231 - ``oland`` support 232 generic 233 address 234 space 235 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 236 ----------------------------------------------------------------------------------------------------------------------- 237 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 238 flat - *pal-amdhsa* - A6 Pro-7050B 239 scratch - *pal-amdpal* - A8-7100 240 - A8 Pro-7150B 241 - A10-7300 242 - A10 Pro-7350B 243 - FX-7500 244 - A8-7200P 245 - A10-7400P 246 - FX-7600P 247 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 248 flat - *pal-amdhsa* - FirePro W9100 249 scratch - *pal-amdpal* - FirePro S9150 250 - FirePro S9170 251 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 252 flat - *pal-amdhsa* - Radeon R9 290x 253 scratch - *pal-amdpal* - Radeon R390 254 - Radeon R390x 255 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 256 - ``mullins`` flat - *pal-amdpal* - E1-2200 257 scratch - E1-2500 258 - E2-3000 259 - E2-3800 260 - A4-5000 261 - A4-5100 262 - A6-5200 263 - A4 Pro-3340B 264 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 265 flat - *pal-amdpal* - Radeon HD 8770 266 scratch - R7 260 267 - R7 260X 268 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 269 flat - *pal-amdpal* 270 scratch .. TODO:: 271 272 Add product 273 names. 274 275 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 276 ----------------------------------------------------------------------------------------------------------------------- 277 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 278 flat - *pal-amdhsa* - Pro A6-8500B 279 scratch - *pal-amdpal* - A8-8600P 280 - Pro A8-8600B 281 - FX-8800P 282 - Pro A12-8800B 283 - A10-8700P 284 - Pro A10-8700B 285 - A10-8780P 286 - A10-9600P 287 - A10-9630P 288 - A12-9700P 289 - A12-9730P 290 - FX-9800P 291 - FX-9830P 292 - E2-9010 293 - A6-9210 294 - A9-9410 295 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 296 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 297 scratch - *pal-amdpal* - Radeon R9 385 298 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 299 - *pal-amdhsa* - Radeon R9 Fury 300 - *pal-amdpal* - Radeon R9 FuryX 301 - Radeon Pro Duo 302 - FirePro S9300x2 303 - Radeon Instinct MI8 304 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 305 flat - *pal-amdhsa* - Radeon RX 480 306 scratch - *pal-amdpal* - Radeon Instinct MI6 307 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 308 flat - *pal-amdhsa* 309 scratch - *pal-amdpal* 310 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 311 flat - *pal-amdhsa* - FirePro S7100 312 scratch - *pal-amdpal* - FirePro W7100 313 - Mobile FirePro 314 M7170 315 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 316 flat - *pal-amdhsa* 317 scratch - *pal-amdpal* .. TODO:: 318 319 Add product 320 names. 321 322 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ 323 ----------------------------------------------------------------------------------------------------------------------- 324 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 325 flat - *pal-amdhsa* Frontier Edition 326 scratch - *pal-amdpal* - Radeon RX Vega 56 327 - Radeon RX Vega 64 328 - Radeon RX Vega 64 329 Liquid 330 - Radeon Instinct MI25 331 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 332 flat - *pal-amdhsa* - Ryzen 5 2400G 333 scratch - *pal-amdpal* 334 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 335 - *pal-amdhsa* 336 - *pal-amdpal* .. TODO:: 337 338 Add product 339 names. 340 341 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 342 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 343 scratch - *pal-amdpal* - Radeon VII 344 - Radeon Pro VII 345 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 346 - xnack - Absolute 347 flat 348 scratch 349 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 350 flat 351 scratch .. TODO:: 352 353 Add product 354 names. 355 356 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 357 - tgsplit flat 358 - xnack scratch .. TODO:: 359 - Packed 360 work-item Add product 361 IDs names. 362 363 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 364 flat - Ryzen 7 4700GE 365 scratch - Ryzen 5 4600G 366 - Ryzen 5 4600GE 367 - Ryzen 3 4300G 368 - Ryzen 3 4300GE 369 - Ryzen Pro 4000G 370 - Ryzen 7 Pro 4700G 371 - Ryzen 7 Pro 4750GE 372 - Ryzen 5 Pro 4650G 373 - Ryzen 5 Pro 4650GE 374 - Ryzen 3 Pro 4350G 375 - Ryzen 3 Pro 4350GE 376 377 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA* 378 - tgsplit flat 379 - xnack scratch .. TODO:: 380 - Packed 381 work-item Add product 382 IDs names. 383 384 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 385 ----------------------------------------------------------------------------------------------------------------------- 386 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 387 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 388 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 389 - Radeon Pro 5600M 390 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 391 - wavefrontsize64 - Absolute - *pal-amdhsa* 392 - xnack flat - *pal-amdpal* 393 scratch 394 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 395 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 396 - xnack scratch - *pal-amdpal* 397 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 398 - wavefrontsize64 flat - *pal-amdhsa* 399 - xnack scratch - *pal-amdpal* .. TODO:: 400 401 Add product 402 names. 403 404 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 405 ----------------------------------------------------------------------------------------------------------------------- 406 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 407 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 408 scratch - *pal-amdpal* - Radeon RX 6900 XT 409 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 410 - wavefrontsize64 flat - *pal-amdhsa* 411 scratch - *pal-amdpal* 412 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 413 - wavefrontsize64 flat - *pal-amdhsa* 414 scratch - *pal-amdpal* .. TODO:: 415 416 Add product 417 names. 418 419 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 420 - wavefrontsize64 flat 421 scratch .. TODO:: 422 423 Add product 424 names. 425 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 426 - wavefrontsize64 flat 427 scratch .. TODO:: 428 429 Add product 430 names. 431 432 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 433 - wavefrontsize64 flat 434 scratch .. TODO:: 435 Add product 436 names. 437 438 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 439 - wavefrontsize64 flat 440 scratch .. TODO:: 441 442 Add product 443 names. 444 445 =========== =============== ============ ===== ================= =============== =============== ====================== 446 447.. _amdgpu-target-features: 448 449Target Features 450--------------- 451 452Target features control how code is generated to support certain 453processor specific features. Not all target features are supported by 454all processors. The runtime must ensure that the features supported by 455the device used to execute the code match the features enabled when 456generating the code. A mismatch of features may result in incorrect 457execution, or a reduction in performance. 458 459The target features supported by each processor is listed in 460:ref:`amdgpu-processor-table`. 461 462Target features are controlled by exactly one of the following Clang 463options: 464 465``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 466 467 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 468 optional components of the target ID. If omitted, the target feature has the 469 ``any`` value. See :ref:`amdgpu-target-id`. 470 471``-m[no-]<target-feature>`` 472 473 Target features not specified by the target ID are specified using a 474 separate option. These target features can have an ``on`` or ``off`` 475 value. ``on`` is specified by omitting the ``no-`` prefix, and 476 ``off`` is specified by including the ``no-`` prefix. The default 477 if not specified is ``off``. 478 479For example: 480 481``-mcpu=gfx908:xnack+`` 482 Enable the ``xnack`` feature. 483``-mcpu=gfx908:xnack-`` 484 Disable the ``xnack`` feature. 485``-mcumode`` 486 Enable the ``cumode`` feature. 487``-mno-cumode`` 488 Disable the ``cumode`` feature. 489 490 .. table:: AMDGPU Target Features 491 :name: amdgpu-target-features-table 492 493 =============== ============================ ================================================== 494 Target Feature Clang Option to Control Description 495 Name 496 =============== ============================ ================================================== 497 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 498 when generating code for kernels. When disabled 499 native WGP wavefront execution mode is used, 500 when enabled CU wavefront execution mode is used 501 (see :ref:`amdgpu-amdhsa-memory-model`). 502 503 sramecc - ``-mcpu`` If specified, generate code that can only be 504 - ``--offload-arch`` loaded and executed in a process that has a 505 matching setting for SRAMECC. 506 507 If not specified for code object V2 to V3, generate 508 code that can be loaded and executed in a process 509 with SRAMECC enabled. 510 511 If not specified for code object V4 or above, generate 512 code that can be loaded and executed in a process 513 with either setting of SRAMECC. 514 515 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 516 work-groups are launched in threadgroup split mode. 517 When enabled the waves of a work-group may be 518 launched in different CUs. 519 520 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 521 generating code for kernels. When disabled 522 native wavefront size 32 is used, when enabled 523 wavefront size 64 is used. 524 525 xnack - ``-mcpu`` If specified, generate code that can only be 526 - ``--offload-arch`` loaded and executed in a process that has a 527 matching setting for XNACK replay. 528 529 If not specified for code object V2 to V3, generate 530 code that can be loaded and executed in a process 531 with XNACK replay enabled. 532 533 If not specified for code object V4 or above, generate 534 code that can be loaded and executed in a process 535 with either setting of XNACK replay. 536 537 XNACK replay can be used for demand paging and 538 page migration. If enabled in the device, then if 539 a page fault occurs the code may execute 540 incorrectly unless generated with XNACK replay 541 enabled, or generated for code object V4 or above without 542 specifying XNACK replay. Executing code that was 543 generated with XNACK replay enabled, or generated 544 for code object V4 or above without specifying XNACK replay, 545 on a device that does not have XNACK replay 546 enabled will execute correctly but may be less 547 performant than code generated for XNACK replay 548 disabled. 549 =============== ============================ ================================================== 550 551.. _amdgpu-target-id: 552 553Target ID 554--------- 555 556AMDGPU supports target IDs. See `Clang Offload Bundler 557<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 558description. The AMDGPU target specific information is: 559 560**processor** 561 Is an AMDGPU processor or alternative processor name specified in 562 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 563 the primary processor and alternative processor names. The canonical form 564 target ID only allow the primary processor name. 565 566**target-feature** 567 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 568 is supported by the processor. The target features supported by each processor 569 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 570 a target ID are marked as being controlled by ``-mcpu`` and 571 ``--offload-arch``. Each target feature must appear at most once in a target 572 ID. The non-canonical form target ID allows the target features to be 573 specified in any order. The canonical form target ID requires the target 574 features to be specified in alphabetic order. 575 576.. _amdgpu-target-id-v2-v3: 577 578Code Object V2 to V3 Target ID 579~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 580 581The target ID syntax for code object V2 to V3 is the same as defined in `Clang 582Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 583when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 584directive and the bundle entry ID. In those cases it has the following BNF 585syntax: 586 587.. code:: 588 589 <target-id> ::== <processor> ( "+" <target-feature> )* 590 591Where a target feature is omitted if *Off* and present if *On* or *Any*. 592 593.. note:: 594 595 The code object V2 to V3 cannot represent *Any* and treats it the same as 596 *On*. 597 598.. _amdgpu-embedding-bundled-objects: 599 600Embedding Bundled Code Objects 601------------------------------ 602 603AMDGPU supports the HIP and OpenMP languages that perform code object embedding 604as described in `Clang Offload Bundler 605<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 606 607.. note:: 608 609 The target ID syntax used for code object V2 to V3 for a bundle entry ID 610 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 611 612.. _amdgpu-address-spaces: 613 614Address Spaces 615-------------- 616 617The AMDGPU architecture supports a number of memory address spaces. The address 618space names use the OpenCL standard names, with some additions. 619 620The AMDGPU address spaces correspond to target architecture specific LLVM 621address space numbers used in LLVM IR. 622 623The AMDGPU address spaces are described in 624:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 625supported for the ``amdgcn`` target. 626 627 .. table:: AMDGPU Address Spaces 628 :name: amdgpu-address-spaces-table 629 630 ================================= =============== =========== ================ ======= ============================ 631 .. 64-Bit Process Address Space 632 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 633 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 634 Space Number Name Name Size 635 ================================= =============== =========== ================ ======= ============================ 636 Generic 0 flat flat 64 0x0000000000000000 637 Global 1 global global 64 0x0000000000000000 638 Region 2 N/A GDS 32 *not implemented for AMDHSA* 639 Local 3 group LDS 32 0xFFFFFFFF 640 Constant 4 constant *same as global* 64 0x0000000000000000 641 Private 5 private scratch 32 0xFFFFFFFF 642 Constant 32-bit 6 *TODO* 0x00000000 643 Buffer Fat Pointer (experimental) 7 *TODO* 644 ================================= =============== =========== ================ ======= ============================ 645 646**Generic** 647 The generic address space is supported unless the *Target Properties* column 648 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 649 space*. 650 651 The generic address space uses the hardware flat address support for two fixed 652 ranges of virtual addresses (the private and local apertures), that are 653 outside the range of addressable global memory, to map from a flat address to 654 a private or local address. This uses FLAT instructions that can take a flat 655 address and access global, private (scratch), and group (LDS) memory depending 656 on if the address is within one of the aperture ranges. 657 658 Flat access to scratch requires hardware aperture setup and setup in the 659 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 660 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 661 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 662 663 To convert between a private or group address space address (termed a segment 664 address) and a flat address the base address of the corresponding aperture 665 can be used. For GFX7-GFX8 these are available in the 666 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 667 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 668 GFX9-GFX10 the aperture base addresses are directly available as inline 669 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 670 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 671 aligned to 2^32 which makes it easier to convert from flat to segment or 672 segment to flat. 673 674 A global address space address has the same value when used as a flat address 675 so no conversion is needed. 676 677**Global and Constant** 678 The global and constant address spaces both use global virtual addresses, 679 which are the same virtual address space used by the CPU. However, some 680 virtual addresses may only be accessible to the CPU, some only accessible 681 by the GPU, and some by both. 682 683 Using the constant address space indicates that the data will not change 684 during the execution of the kernel. This allows scalar read instructions to 685 be used. As the constant address space could only be modified on the host 686 side, a generic pointer loaded from the constant address space is safe to be 687 assumed as a global pointer since only the device global memory is visible 688 and managed on the host side. The vector and scalar L1 caches are invalidated 689 of volatile data before each kernel dispatch execution to allow constant 690 memory to change values between kernel dispatches. 691 692**Region** 693 The region address space uses the hardware Global Data Store (GDS). All 694 wavefronts executing on the same device will access the same memory for any 695 given region address. However, the same region address accessed by wavefronts 696 executing on different devices will access different memory. It is higher 697 performance than global memory. It is allocated by the runtime. The data 698 store (DS) instructions can be used to access it. 699 700**Local** 701 The local address space uses the hardware Local Data Store (LDS) which is 702 automatically allocated when the hardware creates the wavefronts of a 703 work-group, and freed when all the wavefronts of a work-group have 704 terminated. All wavefronts belonging to the same work-group will access the 705 same memory for any given local address. However, the same local address 706 accessed by wavefronts belonging to different work-groups will access 707 different memory. It is higher performance than global memory. The data store 708 (DS) instructions can be used to access it. 709 710**Private** 711 The private address space uses the hardware scratch memory support which 712 automatically allocates memory when it creates a wavefront and frees it when 713 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 714 given private address will be different to the memory accessed by another lane 715 of the same or different wavefront for the same private address. 716 717 If a kernel dispatch uses scratch, then the hardware allocates memory from a 718 pool of backing memory allocated by the runtime for each wavefront. The lanes 719 of the wavefront access this using dword (4 byte) interleaving. The mapping 720 used from private address to backing memory address is: 721 722 ``wavefront-scratch-base + 723 ((private-address / 4) * wavefront-size * 4) + 724 (wavefront-lane-id * 4) + (private-address % 4)`` 725 726 If each lane of a wavefront accesses the same private address, the 727 interleaving results in adjacent dwords being accessed and hence requires 728 fewer cache lines to be fetched. 729 730 There are different ways that the wavefront scratch base address is 731 determined by a wavefront (see 732 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 733 734 Scratch memory can be accessed in an interleaved manner using buffer 735 instructions with the scratch buffer descriptor and per wavefront scratch 736 offset, by the scratch instructions, or by flat instructions. Multi-dword 737 access is not supported except by flat and scratch instructions in 738 GFX9-GFX10. 739 740**Constant 32-bit** 741 *TODO* 742 743**Buffer Fat Pointer** 744 The buffer fat pointer is an experimental address space that is currently 745 unsupported in the backend. It exposes a non-integral pointer that is in 746 the future intended to support the modelling of 128-bit buffer descriptors 747 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 748 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 749 model the buffer descriptors used heavily in graphics workloads targeting 750 the backend. 751 752.. _amdgpu-memory-scopes: 753 754Memory Scopes 755------------- 756 757This section provides LLVM memory synchronization scopes supported by the AMDGPU 758backend memory model when the target triple OS is ``amdhsa`` (see 759:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 760 761The memory model supported is based on the HSA memory model [HSA]_ which is 762based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 763relation is transitive over the synchronizes-with relation independent of scope 764and synchronizes-with allows the memory scope instances to be inclusive (see 765table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 766 767This is different to the OpenCL [OpenCL]_ memory model which does not have scope 768inclusion and requires the memory scopes to exactly match. However, this 769is conservatively correct for OpenCL. 770 771 .. table:: AMDHSA LLVM Sync Scopes 772 :name: amdgpu-amdhsa-llvm-sync-scopes-table 773 774 ======================= =================================================== 775 LLVM Sync Scope Description 776 ======================= =================================================== 777 *none* The default: ``system``. 778 779 Synchronizes with, and participates in modification 780 and seq_cst total orderings with, other operations 781 (except image operations) for all address spaces 782 (except private, or generic that accesses private) 783 provided the other operation's sync scope is: 784 785 - ``system``. 786 - ``agent`` and executed by a thread on the same 787 agent. 788 - ``workgroup`` and executed by a thread in the 789 same work-group. 790 - ``wavefront`` and executed by a thread in the 791 same wavefront. 792 793 ``agent`` Synchronizes with, and participates in modification 794 and seq_cst total orderings with, other operations 795 (except image operations) for all address spaces 796 (except private, or generic that accesses private) 797 provided the other operation's sync scope is: 798 799 - ``system`` or ``agent`` and executed by a thread 800 on the same agent. 801 - ``workgroup`` and executed by a thread in the 802 same work-group. 803 - ``wavefront`` and executed by a thread in the 804 same wavefront. 805 806 ``workgroup`` Synchronizes with, and participates in modification 807 and seq_cst total orderings with, other operations 808 (except image operations) for all address spaces 809 (except private, or generic that accesses private) 810 provided the other operation's sync scope is: 811 812 - ``system``, ``agent`` or ``workgroup`` and 813 executed by a thread in the same work-group. 814 - ``wavefront`` and executed by a thread in the 815 same wavefront. 816 817 ``wavefront`` Synchronizes with, and participates in modification 818 and seq_cst total orderings with, other operations 819 (except image operations) for all address spaces 820 (except private, or generic that accesses private) 821 provided the other operation's sync scope is: 822 823 - ``system``, ``agent``, ``workgroup`` or 824 ``wavefront`` and executed by a thread in the 825 same wavefront. 826 827 ``singlethread`` Only synchronizes with and participates in 828 modification and seq_cst total orderings with, 829 other operations (except image operations) running 830 in the same thread for all address spaces (for 831 example, in signal handlers). 832 833 ``one-as`` Same as ``system`` but only synchronizes with other 834 operations within the same address space. 835 836 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 837 operations within the same address space. 838 839 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 840 other operations within the same address space. 841 842 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 843 other operations within the same address space. 844 845 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 846 other operations within the same address space. 847 ======================= =================================================== 848 849LLVM IR Intrinsics 850------------------ 851 852The AMDGPU backend implements the following LLVM IR intrinsics. 853 854*This section is WIP.* 855 856.. TODO:: 857 858 List AMDGPU intrinsics. 859 860LLVM IR Attributes 861------------------ 862 863The AMDGPU backend supports the following LLVM IR attributes. 864 865 .. table:: AMDGPU LLVM IR Attributes 866 :name: amdgpu-llvm-ir-attributes-table 867 868 ======================================= ========================================================== 869 LLVM Attribute Description 870 ======================================= ========================================================== 871 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 872 will be specified when the kernel is dispatched. Generated 873 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 874 The implied default value is 1,1024. 875 876 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 877 argument block size for the implicit arguments. This 878 varies by OS and language (for OpenCL see 879 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 880 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 881 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 882 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 883 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 884 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 885 execution unit. Generated by the ``amdgpu_waves_per_eu`` 886 CLANG attribute [CLANG-ATTR]_. This is an optimization hint, 887 and the backend may not be able to satisfy the request. If 888 the specified range is incompatible with the function's 889 "amdgpu-flat-work-group-size" value, the implied occupancy 890 bounds by the workgroup size takes precedence. 891 892 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 893 mode register to be set on entry. Overrides the default for 894 the calling convention. 895 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 896 the mode register to be set on entry. Overrides the default 897 for the calling convention. 898 899 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the 900 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this 901 attribute, or reached through a call site marked with this attribute, 902 the value returned by the intrinsic is undefined. The backend can 903 generally infer this during code generation, so typically there is no 904 benefit to frontends marking functions with this. 905 906 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the 907 llvm.amdgcn.workitem.id.y intrinsic. 908 909 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the 910 llvm.amdgcn.workitem.id.z intrinsic. 911 912 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the 913 llvm.amdgcn.workgroup.id.x intrinsic. 914 915 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the 916 llvm.amdgcn.workgroup.id.y intrinsic. 917 918 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the 919 llvm.amdgcn.workgroup.id.z intrinsic. 920 921 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the 922 llvm.amdgcn.dispatch.ptr intrinsic. 923 924 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the 925 llvm.amdgcn.implicitarg.ptr intrinsic. 926 927 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the 928 llvm.amdgcn.dispatch.id intrinsic. 929 930 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the 931 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint 932 attributes, the queue pointer may be required in situations where the 933 intrinsic call does not directly appear in the program. Some subtargets 934 require the queue pointer for to handle some addrspacecasts, as well 935 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and 936 llvm.debug intrinsics. 937 938 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 939 kernel argument that holds the pointer to the hostcall buffer. If this 940 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed. 941 942 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit 943 kernel argument that holds the pointer to an initialized memory buffer 944 that conforms to the requirements of the malloc/free device library V1 945 version implementation. If this attribute is absent, then the 946 amdgpu-no-implicitarg-ptr is also removed. 947 948 ======================================= ========================================================== 949 950.. _amdgpu-elf-code-object: 951 952ELF Code Object 953=============== 954 955The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 956can be linked by ``lld`` to produce a standard ELF shared code object which can 957be loaded and executed on an AMDGPU target. 958 959.. _amdgpu-elf-header: 960 961Header 962------ 963 964The AMDGPU backend uses the following ELF header: 965 966 .. table:: AMDGPU ELF Header 967 :name: amdgpu-elf-header-table 968 969 ========================== =============================== 970 Field Value 971 ========================== =============================== 972 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 973 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 974 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 975 - ``ELFOSABI_AMDGPU_HSA`` 976 - ``ELFOSABI_AMDGPU_PAL`` 977 - ``ELFOSABI_AMDGPU_MESA3D`` 978 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 979 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 980 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 981 - ``ELFABIVERSION_AMDGPU_HSA_V5`` 982 - ``ELFABIVERSION_AMDGPU_PAL`` 983 - ``ELFABIVERSION_AMDGPU_MESA3D`` 984 ``e_type`` - ``ET_REL`` 985 - ``ET_DYN`` 986 ``e_machine`` ``EM_AMDGPU`` 987 ``e_entry`` 0 988 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 989 :ref:`amdgpu-elf-header-e_flags-table-v3`, 990 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards` 991 ========================== =============================== 992 993.. 994 995 .. table:: AMDGPU ELF Header Enumeration Values 996 :name: amdgpu-elf-header-enumeration-values-table 997 998 =============================== ===== 999 Name Value 1000 =============================== ===== 1001 ``EM_AMDGPU`` 224 1002 ``ELFOSABI_NONE`` 0 1003 ``ELFOSABI_AMDGPU_HSA`` 64 1004 ``ELFOSABI_AMDGPU_PAL`` 65 1005 ``ELFOSABI_AMDGPU_MESA3D`` 66 1006 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 1007 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 1008 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 1009 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3 1010 ``ELFABIVERSION_AMDGPU_PAL`` 0 1011 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 1012 =============================== ===== 1013 1014``e_ident[EI_CLASS]`` 1015 The ELF class is: 1016 1017 * ``ELFCLASS32`` for ``r600`` architecture. 1018 1019 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 1020 process address space applications. 1021 1022``e_ident[EI_DATA]`` 1023 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 1024 1025``e_ident[EI_OSABI]`` 1026 One of the following AMDGPU target architecture specific OS ABIs 1027 (see :ref:`amdgpu-os`): 1028 1029 * ``ELFOSABI_NONE`` for *unknown* OS. 1030 1031 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 1032 1033 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 1034 1035 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 1036 1037``e_ident[EI_ABIVERSION]`` 1038 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 1039 object conforms: 1040 1041 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 1042 runtime ABI for code object V2. Specify using the Clang option 1043 ``-mcode-object-version=2``. 1044 1045 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 1046 runtime ABI for code object V3. Specify using the Clang option 1047 ``-mcode-object-version=3``. 1048 1049 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 1050 runtime ABI for code object V4. Specify using the Clang option 1051 ``-mcode-object-version=4``. This is the default code object 1052 version if not specified. 1053 1054 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA 1055 runtime ABI for code object V5. Specify using the Clang option 1056 ``-mcode-object-version=5``. 1057 1058 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 1059 runtime ABI. 1060 1061 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 1062 3D runtime ABI. 1063 1064``e_type`` 1065 Can be one of the following values: 1066 1067 1068 ``ET_REL`` 1069 The type produced by the AMDGPU backend compiler as it is relocatable code 1070 object. 1071 1072 ``ET_DYN`` 1073 The type produced by the linker as it is a shared code object. 1074 1075 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 1076 1077``e_machine`` 1078 The value ``EM_AMDGPU`` is used for the machine for all processors supported 1079 by the ``r600`` and ``amdgcn`` architectures (see 1080 :ref:`amdgpu-processor-table`). The specific processor is specified in the 1081 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 1082 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1083 ``e_flags`` for code object V3 and above (see 1084 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1085 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`). 1086 1087``e_entry`` 1088 The entry point is 0 as the entry points for individual kernels must be 1089 selected in order to invoke them through AQL packets. 1090 1091``e_flags`` 1092 The AMDGPU backend uses the following ELF header flags: 1093 1094 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1095 :name: amdgpu-elf-header-e_flags-v2-table 1096 1097 ===================================== ===== ============================= 1098 Name Value Description 1099 ===================================== ===== ============================= 1100 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1101 target feature is 1102 enabled for all code 1103 contained in the code object. 1104 If the processor 1105 does not support the 1106 ``xnack`` target 1107 feature then must 1108 be 0. 1109 See 1110 :ref:`amdgpu-target-features`. 1111 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1112 handler is enabled for all 1113 code contained in the code 1114 object. If the processor 1115 does not support a trap 1116 handler then must be 0. 1117 See 1118 :ref:`amdgpu-target-features`. 1119 ===================================== ===== ============================= 1120 1121 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1122 :name: amdgpu-elf-header-e_flags-table-v3 1123 1124 ================================= ===== ============================= 1125 Name Value Description 1126 ================================= ===== ============================= 1127 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1128 mask for 1129 ``EF_AMDGPU_MACH_xxx`` values 1130 defined in 1131 :ref:`amdgpu-ef-amdgpu-mach-table`. 1132 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1133 target feature is 1134 enabled for all code 1135 contained in the code object. 1136 If the processor 1137 does not support the 1138 ``xnack`` target 1139 feature then must 1140 be 0. 1141 See 1142 :ref:`amdgpu-target-features`. 1143 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1144 target feature is 1145 enabled for all code 1146 contained in the code object. 1147 If the processor 1148 does not support the 1149 ``sramecc`` target 1150 feature then must 1151 be 0. 1152 See 1153 :ref:`amdgpu-target-features`. 1154 ================================= ===== ============================= 1155 1156 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After 1157 :name: amdgpu-elf-header-e_flags-table-v4-onwards 1158 1159 ============================================ ===== =================================== 1160 Name Value Description 1161 ============================================ ===== =================================== 1162 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1163 mask for 1164 ``EF_AMDGPU_MACH_xxx`` values 1165 defined in 1166 :ref:`amdgpu-ef-amdgpu-mach-table`. 1167 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1168 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1169 values. 1170 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1171 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1172 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1173 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1174 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1175 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1176 values. 1177 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1178 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1179 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1180 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1181 ============================================ ===== =================================== 1182 1183 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1184 :name: amdgpu-ef-amdgpu-mach-table 1185 1186 ==================================== ========== ============================= 1187 Name Value Description (see 1188 :ref:`amdgpu-processor-table`) 1189 ==================================== ========== ============================= 1190 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1191 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1192 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1193 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1194 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1195 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1196 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1197 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1198 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1199 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1200 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1201 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1202 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1203 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1204 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1205 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1206 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1207 *reserved* 0x011 - Reserved for ``r600`` 1208 0x01f architecture processors. 1209 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1210 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1211 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1212 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1213 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1214 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1215 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1216 *reserved* 0x027 Reserved. 1217 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1218 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1219 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1220 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1221 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1222 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1223 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1224 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1225 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1226 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1227 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1228 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1229 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1230 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1231 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1232 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1233 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1234 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1235 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1236 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1237 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1238 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 1239 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1240 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1241 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940`` 1242 *reserved* 0x041 Reserved. 1243 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 1244 *reserved* 0x043 Reserved. 1245 *reserved* 0x044 Reserved. 1246 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036`` 1247 ==================================== ========== ============================= 1248 1249Sections 1250-------- 1251 1252An AMDGPU target ELF code object has the standard ELF sections which include: 1253 1254 .. table:: AMDGPU ELF Sections 1255 :name: amdgpu-elf-sections-table 1256 1257 ================== ================ ================================= 1258 Name Type Attributes 1259 ================== ================ ================================= 1260 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1261 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1262 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1263 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1264 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1265 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1266 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1267 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1268 ``.note`` ``SHT_NOTE`` *none* 1269 ``.rela``\ *name* ``SHT_RELA`` *none* 1270 ``.rela.dyn`` ``SHT_RELA`` *none* 1271 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1272 ``.shstrtab`` ``SHT_STRTAB`` *none* 1273 ``.strtab`` ``SHT_STRTAB`` *none* 1274 ``.symtab`` ``SHT_SYMTAB`` *none* 1275 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1276 ================== ================ ================================= 1277 1278These sections have their standard meanings (see [ELF]_) and are only generated 1279if needed. 1280 1281``.debug``\ *\** 1282 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1283 information on the DWARF produced by the AMDGPU backend. 1284 1285``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1286 The standard sections used by a dynamic loader. 1287 1288``.note`` 1289 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1290 backend. 1291 1292``.rela``\ *name*, ``.rela.dyn`` 1293 For relocatable code objects, *name* is the name of the section that the 1294 relocation records apply. For example, ``.rela.text`` is the section name for 1295 relocation records associated with the ``.text`` section. 1296 1297 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1298 records from each of the relocatable code object's ``.rela``\ *name* sections. 1299 1300 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1301 the AMDGPU backend. 1302 1303``.text`` 1304 The executable machine code for the kernels and functions they call. Generated 1305 as position independent code. See :ref:`amdgpu-code-conventions` for 1306 information on conventions used in the isa generation. 1307 1308.. _amdgpu-note-records: 1309 1310Note Records 1311------------ 1312 1313The AMDGPU backend code object contains ELF note records in the ``.note`` 1314section. The set of generated notes and their semantics depend on the code 1315object version; see :ref:`amdgpu-note-records-v2` and 1316:ref:`amdgpu-note-records-v3-onwards`. 1317 1318As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1319must be generated after the ``name`` field to ensure the ``desc`` field is 4 1320byte aligned. In addition, minimal zero-byte padding must be generated to 1321ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1322field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1323alignment. 1324 1325.. _amdgpu-note-records-v2: 1326 1327Code Object V2 Note Records 1328~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1329 1330.. warning:: 1331 Code object V2 is not the default code object version emitted by 1332 this version of LLVM. 1333 1334The AMDGPU backend code object uses the following ELF note record in the 1335``.note`` section when compiling for code object V2. 1336 1337The note record vendor field is "AMD". 1338 1339Additional note records may be present, but any which are not documented here 1340are deprecated and should not be used. 1341 1342 .. table:: AMDGPU Code Object V2 ELF Note Records 1343 :name: amdgpu-elf-note-records-v2-table 1344 1345 ===== ===================================== ====================================== 1346 Name Type Description 1347 ===== ===================================== ====================================== 1348 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1349 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1350 Finalizer and not the LLVM compiler. 1351 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1352 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1353 YAML [YAML]_ textual format. 1354 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1355 ===== ===================================== ====================================== 1356 1357.. 1358 1359 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1360 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1361 1362 ===================================== ===== 1363 Name Value 1364 ===================================== ===== 1365 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1366 ``NT_AMD_HSA_HSAIL`` 2 1367 ``NT_AMD_HSA_ISA_VERSION`` 3 1368 *reserved* 4-9 1369 ``NT_AMD_HSA_METADATA`` 10 1370 ``NT_AMD_HSA_ISA_NAME`` 11 1371 ===================================== ===== 1372 1373``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1374 Specifies the code object version number. The description field has the 1375 following layout: 1376 1377 .. code:: c 1378 1379 struct amdgpu_hsa_note_code_object_version_s { 1380 uint32_t major_version; 1381 uint32_t minor_version; 1382 }; 1383 1384 The ``major_version`` has a value less than or equal to 2. 1385 1386``NT_AMD_HSA_HSAIL`` 1387 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1388 field has the following layout: 1389 1390 .. code:: c 1391 1392 struct amdgpu_hsa_note_hsail_s { 1393 uint32_t hsail_major_version; 1394 uint32_t hsail_minor_version; 1395 uint8_t profile; 1396 uint8_t machine_model; 1397 uint8_t default_float_round; 1398 }; 1399 1400``NT_AMD_HSA_ISA_VERSION`` 1401 Specifies the target ISA version. The description field has the following layout: 1402 1403 .. code:: c 1404 1405 struct amdgpu_hsa_note_isa_s { 1406 uint16_t vendor_name_size; 1407 uint16_t architecture_name_size; 1408 uint32_t major; 1409 uint32_t minor; 1410 uint32_t stepping; 1411 char vendor_and_architecture_name[1]; 1412 }; 1413 1414 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1415 vendor and architecture names respectively, including the NUL character. 1416 1417 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1418 vendor, immediately followed by the NUL terminated string for the 1419 architecture. 1420 1421 This note record is used by the HSA runtime loader. 1422 1423 Code object V2 only supports a limited number of processors and has fixed 1424 settings for target features. See 1425 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1426 processors and the corresponding target ID. In the table the note record ISA 1427 name is a concatenation of the vendor name, architecture name, major, minor, 1428 and stepping separated by a ":". 1429 1430 The target ID column shows the processor name and fixed target features used 1431 by the LLVM compiler. The LLVM compiler does not generate a 1432 ``NT_AMD_HSA_HSAIL`` note record. 1433 1434 A code object generated by the Finalizer also uses code object V2 and always 1435 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1436 ``sramecc`` target feature is as shown in 1437 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1438 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1439 bit. 1440 1441``NT_AMD_HSA_ISA_NAME`` 1442 Specifies the target ISA name as a non-NUL terminated string. 1443 1444 This note record is not used by the HSA runtime loader. 1445 1446 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1447 V2's limited support of processors and fixed settings for target features. 1448 1449 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1450 from the string to the corresponding target ID. If the ``xnack`` target 1451 feature is supported and enabled, the string produced by the LLVM compiler 1452 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1453 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1454 1455``NT_AMD_HSA_METADATA`` 1456 Specifies extensible metadata associated with the code objects executed on HSA 1457 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1458 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1459 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1460 metadata string. 1461 1462 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1463 :name: amdgpu-elf-note-record-supported_processors-v2-table 1464 1465 ===================== ========================== 1466 Note Record ISA Name Target ID 1467 ===================== ========================== 1468 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1469 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1470 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1471 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1472 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1473 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1474 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1475 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1476 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1477 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1478 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1479 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1480 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1481 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1482 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1483 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1484 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1485 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1486 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1487 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1488 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1489 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1490 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1491 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1492 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1493 ===================== ========================== 1494 1495.. _amdgpu-note-records-v3-onwards: 1496 1497Code Object V3 and Above Note Records 1498~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1499 1500The AMDGPU backend code object uses the following ELF note record in the 1501``.note`` section when compiling for code object V3 and above. 1502 1503The note record vendor field is "AMDGPU". 1504 1505Additional note records may be present, but any which are not documented here 1506are deprecated and should not be used. 1507 1508 .. table:: AMDGPU Code Object V3 and Above ELF Note Records 1509 :name: amdgpu-elf-note-records-table-v3-onwards 1510 1511 ======== ============================== ====================================== 1512 Name Type Description 1513 ======== ============================== ====================================== 1514 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1515 binary format. 1516 ======== ============================== ====================================== 1517 1518.. 1519 1520 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values 1521 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards 1522 1523 ============================== ===== 1524 Name Value 1525 ============================== ===== 1526 *reserved* 0-31 1527 ``NT_AMDGPU_METADATA`` 32 1528 ============================== ===== 1529 1530``NT_AMDGPU_METADATA`` 1531 Specifies extensible metadata associated with an AMDGPU code object. It is 1532 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1533 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 1534 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and 1535 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the 1536 ``amdhsa`` OS. 1537 1538.. _amdgpu-symbols: 1539 1540Symbols 1541------- 1542 1543Symbols include the following: 1544 1545 .. table:: AMDGPU ELF Symbols 1546 :name: amdgpu-elf-symbols-table 1547 1548 ===================== ================== ================ ================== 1549 Name Type Section Description 1550 ===================== ================== ================ ================== 1551 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1552 - ``.rodata`` 1553 - ``.bss`` 1554 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1555 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1556 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1557 ===================== ================== ================ ================== 1558 1559Global variable 1560 Global variables both used and defined by the compilation unit. 1561 1562 If the symbol is defined in the compilation unit then it is allocated in the 1563 appropriate section according to if it has initialized data or is readonly. 1564 1565 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1566 will resolve relocations using the definition provided by another code object 1567 or explicitly defined by the runtime. 1568 1569 If the symbol resides in local/group memory (LDS) then its section is the 1570 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1571 ``st_value`` field describes alignment requirements as it does for common 1572 symbols. 1573 1574 .. TODO:: 1575 1576 Add description of linked shared object symbols. Seems undefined symbols 1577 are marked as STT_NOTYPE. 1578 1579Kernel descriptor 1580 Every HSA kernel has an associated kernel descriptor. It is the address of the 1581 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1582 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1583 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1584 1585Kernel entry point 1586 Every HSA kernel also has a symbol for its machine code entry point. 1587 1588.. _amdgpu-relocation-records: 1589 1590Relocation Records 1591------------------ 1592 1593AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1594relocatable fields are: 1595 1596``word32`` 1597 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1598 alignment. These values use the same byte order as other word values in the 1599 AMDGPU architecture. 1600 1601``word64`` 1602 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1603 alignment. These values use the same byte order as other word values in the 1604 AMDGPU architecture. 1605 1606Following notations are used for specifying relocation calculations: 1607 1608**A** 1609 Represents the addend used to compute the value of the relocatable field. 1610 1611**G** 1612 Represents the offset into the global offset table at which the relocation 1613 entry's symbol will reside during execution. 1614 1615**GOT** 1616 Represents the address of the global offset table. 1617 1618**P** 1619 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1620 of the storage unit being relocated (computed using ``r_offset``). 1621 1622**S** 1623 Represents the value of the symbol whose index resides in the relocation 1624 entry. Relocations not using this must specify a symbol index of 1625 ``STN_UNDEF``. 1626 1627**B** 1628 Represents the base address of a loaded executable or shared object which is 1629 the difference between the ELF address and the actual load address. 1630 Relocations using this are only valid in executable or shared objects. 1631 1632The following relocation types are supported: 1633 1634 .. table:: AMDGPU ELF Relocation Records 1635 :name: amdgpu-elf-relocation-records-table 1636 1637 ========================== ======= ===== ========== ============================== 1638 Relocation Type Kind Value Field Calculation 1639 ========================== ======= ===== ========== ============================== 1640 ``R_AMDGPU_NONE`` 0 *none* *none* 1641 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1642 Dynamic 1643 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1644 Dynamic 1645 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1646 Dynamic 1647 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1648 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1649 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1650 Dynamic 1651 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1652 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1653 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1654 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1655 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1656 *reserved* 12 1657 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1658 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 1659 ========================== ======= ===== ========== ============================== 1660 1661``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1662the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1663 1664There is no current OS loader support for 32-bit programs and so 1665``R_AMDGPU_ABS32`` is not used. 1666 1667.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1668 1669Loaded Code Object Path Uniform Resource Identifier (URI) 1670--------------------------------------------------------- 1671 1672The AMD GPU code object loader represents the path of the ELF shared object from 1673which the code object was loaded as a textual Uniform Resource Identifier (URI). 1674Note that the code object is the in memory loaded relocated form of the ELF 1675shared object. Multiple code objects may be loaded at different memory 1676addresses in the same process from the same ELF shared object. 1677 1678The loaded code object path URI syntax is defined by the following BNF syntax: 1679 1680.. code:: 1681 1682 code_object_uri ::== file_uri | memory_uri 1683 file_uri ::== "file://" file_path [ range_specifier ] 1684 memory_uri ::== "memory://" process_id range_specifier 1685 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1686 file_path ::== URI_ENCODED_OS_FILE_PATH 1687 process_id ::== DECIMAL_NUMBER 1688 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1689 1690**number** 1691 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1692 and octal values by "0". 1693 1694**file_path** 1695 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1696 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1697 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1698 the path are separated by "/". 1699 1700**offset** 1701 Is a 0-based byte offset to the start of the code object. For a file URI, it 1702 is from the start of the file specified by the ``file_path``, and if omitted 1703 defaults to 0. For a memory URI, it is the memory address and is required. 1704 1705**size** 1706 Is the number of bytes in the code object. For a file URI, if omitted it 1707 defaults to the size of the file. It is required for a memory URI. 1708 1709**process_id** 1710 Is the identity of the process owning the memory. For Linux it is the C 1711 unsigned integral decimal literal for the process ID (PID). 1712 1713For example: 1714 1715.. code:: 1716 1717 file:///dir1/dir2/file1 1718 file:///dir3/dir4/file2#offset=0x2000&size=3000 1719 memory://1234#offset=0x20000&size=3000 1720 1721.. _amdgpu-dwarf-debug-information: 1722 1723DWARF Debug Information 1724======================= 1725 1726.. warning:: 1727 1728 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1729 is not currently fully implemented and is subject to change. 1730 1731AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1732:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1733object executable code and data to the source language constructs. It can be 1734used by tools such as debuggers and profilers. It uses features defined in 1735:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1736DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1737 1738This section defines the AMDGPU target architecture specific DWARF mappings. 1739 1740.. _amdgpu-dwarf-register-identifier: 1741 1742Register Identifier 1743------------------- 1744 1745This section defines the AMDGPU target architecture register numbers used in 1746DWARF operation expressions (see DWARF Version 5 section 2.5 and 1747:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1748instructions (see DWARF Version 5 section 6.4 and 1749:ref:`amdgpu-dwarf-call-frame-information`). 1750 1751A single code object can contain code for kernels that have different wavefront 1752sizes. The vector registers and some scalar registers are based on the wavefront 1753size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1754simplifies the consumer of the DWARF so that each register has a fixed size, 1755rather than being dynamic according to the wavefront size mode. Similarly, 1756distinct DWARF registers are defined for those registers that vary in size 1757according to the process address size. This allows a consumer to treat a 1758specific AMDGPU processor as a single architecture regardless of how it is 1759configured at run time. The compiler explicitly specifies the DWARF registers 1760that match the mode in which the code it is generating will be executed. 1761 1762DWARF registers are encoded as numbers, which are mapped to architecture 1763registers. The mapping for AMDGPU is defined in 1764:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1765mapping. 1766 1767.. table:: AMDGPU DWARF Register Mapping 1768 :name: amdgpu-dwarf-register-mapping-table 1769 1770 ============== ================= ======== ================================== 1771 DWARF Register AMDGPU Register Bit Size Description 1772 ============== ================= ======== ================================== 1773 0 PC_32 32 Program Counter (PC) when 1774 executing in a 32-bit process 1775 address space. Used in the CFI to 1776 describe the PC of the calling 1777 frame. 1778 1 EXEC_MASK_32 32 Execution Mask Register when 1779 executing in wavefront 32 mode. 1780 2-15 *Reserved* *Reserved for highly accessed 1781 registers using DWARF shortcut.* 1782 16 PC_64 64 Program Counter (PC) when 1783 executing in a 64-bit process 1784 address space. Used in the CFI to 1785 describe the PC of the calling 1786 frame. 1787 17 EXEC_MASK_64 64 Execution Mask Register when 1788 executing in wavefront 64 mode. 1789 18-31 *Reserved* *Reserved for highly accessed 1790 registers using DWARF shortcut.* 1791 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1792 Registers. 1793 96-127 *Reserved* *Reserved for frequently accessed 1794 registers using DWARF 1-byte ULEB.* 1795 128 STATUS 32 Status Register. 1796 129-511 *Reserved* *Reserved for future Scalar 1797 Architectural Registers.* 1798 512 VCC_32 32 Vector Condition Code Register 1799 when executing in wavefront 32 1800 mode. 1801 513-767 *Reserved* *Reserved for future Vector 1802 Architectural Registers when 1803 executing in wavefront 32 mode.* 1804 768 VCC_64 64 Vector Condition Code Register 1805 when executing in wavefront 64 1806 mode. 1807 769-1023 *Reserved* *Reserved for future Vector 1808 Architectural Registers when 1809 executing in wavefront 64 mode.* 1810 1024-1087 *Reserved* *Reserved for padding.* 1811 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1812 1130-1535 *Reserved* *Reserved for future Scalar 1813 General Purpose Registers.* 1814 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1815 when executing in wavefront 32 1816 mode. 1817 1792-2047 *Reserved* *Reserved for future Vector 1818 General Purpose Registers when 1819 executing in wavefront 32 mode.* 1820 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1821 when executing in wavefront 32 1822 mode. 1823 2304-2559 *Reserved* *Reserved for future Vector 1824 Accumulation Registers when 1825 executing in wavefront 32 mode.* 1826 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1827 when executing in wavefront 64 1828 mode. 1829 2816-3071 *Reserved* *Reserved for future Vector 1830 General Purpose Registers when 1831 executing in wavefront 64 mode.* 1832 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1833 when executing in wavefront 64 1834 mode. 1835 3328-3583 *Reserved* *Reserved for future Vector 1836 Accumulation Registers when 1837 executing in wavefront 64 mode.* 1838 ============== ================= ======== ================================== 1839 1840The vector registers are represented as the full size for the wavefront. They 1841are organized as consecutive dwords (32-bits), one per lane, with the dword at 1842the least significant bit position corresponding to lane 0 and so forth. DWARF 1843location expressions involving the ``DW_OP_LLVM_offset`` and 1844``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1845register corresponding to the lane that is executing the current thread of 1846execution in languages that are implemented using a SIMD or SIMT execution 1847model. 1848 1849If the wavefront size is 32 lanes then the wavefront 32 mode register 1850definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1851mode register definitions are used. Some AMDGPU targets support executing in 1852both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1853to the wavefront mode of the generated code will be used. 1854 1855If code is generated to execute in a 32-bit process address space, then the 185632-bit process address space register definitions are used. If code is generated 1857to execute in a 64-bit process address space, then the 64-bit process address 1858space register definitions are used. The ``amdgcn`` target only supports the 185964-bit process address space. 1860 1861.. _amdgpu-dwarf-address-class-identifier: 1862 1863Address Class Identifier 1864------------------------ 1865 1866The DWARF address class represents the source language memory space. See DWARF 1867Version 5 section 2.12 which is updated by the *DWARF Extensions For 1868Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1869 1870The DWARF address class mapping used for AMDGPU is defined in 1871:ref:`amdgpu-dwarf-address-class-mapping-table`. 1872 1873.. table:: AMDGPU DWARF Address Class Mapping 1874 :name: amdgpu-dwarf-address-class-mapping-table 1875 1876 ========================= ====== ================= 1877 DWARF AMDGPU 1878 -------------------------------- ----------------- 1879 Address Class Name Value Address Space 1880 ========================= ====== ================= 1881 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1882 ``DW_ADDR_LLVM_global`` 0x0001 Global 1883 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1884 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1885 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1886 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1887 ========================= ====== ================= 1888 1889The DWARF address class values defined in the *DWARF Extensions For 1890Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1891 1892In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1893available for use for the AMD extension for access to the hardware GDS memory 1894which is scratchpad memory allocated per device. 1895 1896For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1897address class of ``DW_ADDR_none`` is used. 1898 1899See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1900mapping of DWARF address classes to DWARF address spaces, including address size 1901and NULL value. 1902 1903.. _amdgpu-dwarf-address-space-identifier: 1904 1905Address Space Identifier 1906------------------------ 1907 1908DWARF address spaces correspond to target architecture specific linear 1909addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1910For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1911 1912The DWARF address space mapping used for AMDGPU is defined in 1913:ref:`amdgpu-dwarf-address-space-mapping-table`. 1914 1915.. table:: AMDGPU DWARF Address Space Mapping 1916 :name: amdgpu-dwarf-address-space-mapping-table 1917 1918 ======================================= ===== ======= ======== ================= ======================= 1919 DWARF AMDGPU Notes 1920 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1921 Address Space Name Value Address Bit Size Address Space 1922 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1923 .. 64-bit 32-bit 1924 process process 1925 address address 1926 space space 1927 ======================================= ===== ======= ======== ================= ======================= 1928 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1929 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1930 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1931 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1932 *Reserved* 0x04 1933 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1934 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1935 ======================================= ===== ======= ======== ================= ======================= 1936 1937See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1938including address size and NULL value. 1939 1940The ``DW_ASPACE_none`` address space is the default target architecture address 1941space used in DWARF operations that do not specify an address space. It 1942therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1943related operations can refer to addresses in the program code. 1944 1945The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1946specify the flat address space. If the address corresponds to an address in the 1947local address space, then it corresponds to the wavefront that is executing the 1948focused thread of execution. If the address corresponds to an address in the 1949private address space, then it corresponds to the lane that is executing the 1950focused thread of execution for languages that are implemented using a SIMD or 1951SIMT execution model. 1952 1953.. note:: 1954 1955 CUDA-like languages such as HIP that do not have address spaces in the 1956 language type system, but do allow variables to be allocated in different 1957 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1958 address space in the DWARF expression operations as the default address space 1959 is the global address space. 1960 1961The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1962specify the local address space corresponding to the wavefront that is executing 1963the focused thread of execution. 1964 1965The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1966to specify the private address space corresponding to the lane that is executing 1967the focused thread of execution for languages that are implemented using a SIMD 1968or SIMT execution model. 1969 1970The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1971to specify the unswizzled private address space corresponding to the wavefront 1972that is executing the focused thread of execution. The wavefront view of private 1973memory is the per wavefront unswizzled backing memory layout defined in 1974:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1975location for the backing memory of the wavefront (namely the address is not 1976offset by ``wavefront-scratch-base``). The following formula can be used to 1977convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1978``DW_ASPACE_AMDGPU_private_wave`` address: 1979 1980:: 1981 1982 private-address-wavefront = 1983 ((private-address-lane / 4) * wavefront-size * 4) + 1984 (wavefront-lane-id * 4) + (private-address-lane % 4) 1985 1986If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1987of the dwords for each lane starting with lane 0 is required, then this 1988simplifies to: 1989 1990:: 1991 1992 private-address-wavefront = 1993 private-address-lane * wavefront-size 1994 1995A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1996complete spilled vector register back into a complete vector register in the 1997CFI. The frame pointer can be a private lane address which is dword aligned, 1998which can be shifted to multiply by the wavefront size, and then used to form a 1999private wavefront address that gives a location for a contiguous set of dwords, 2000one per lane, where the vector register dwords are spilled. The compiler knows 2001the wavefront size since it generates the code. Note that the type of the 2002address may have to be converted as the size of a 2003``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 2004``DW_ASPACE_AMDGPU_private_wave`` address. 2005 2006.. _amdgpu-dwarf-lane-identifier: 2007 2008Lane identifier 2009--------------- 2010 2011DWARF lane identifies specify a target architecture lane position for hardware 2012that executes in a SIMD or SIMT manner, and on which a source language maps its 2013threads of execution onto those lanes. The DWARF lane identifier is pushed by 2014the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 2015section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 2016section :ref:`amdgpu-dwarf-operation-expressions`. 2017 2018For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 2019wavefront. It is numbered from 0 to the wavefront size minus 1. 2020 2021Operation Expressions 2022--------------------- 2023 2024DWARF expressions are used to compute program values and the locations of 2025program objects. See DWARF Version 5 section 2.5 and 2026:ref:`amdgpu-dwarf-operation-expressions`. 2027 2028DWARF location descriptions describe how to access storage which includes memory 2029and registers. When accessing storage on AMDGPU, bytes are ordered with least 2030significant bytes first, and bits are ordered within bytes with least 2031significant bits first. 2032 2033For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 2034unwinding vector registers that are spilled under the execution mask to memory: 2035the zero-single location description is the vector register, and the one-single 2036location description is the spilled memory location description. The 2037``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 2038memory location description. 2039 2040In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 2041``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 2042controlled by the execution mask. An undefined location description together 2043with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 2044to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 2045 2046Debugger Information Entry Attributes 2047------------------------------------- 2048 2049This section describes how certain debugger information entry attributes are 2050used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1 2051which are updated by *DWARF Extensions For Heterogeneous Debugging* section 2052:ref:`amdgpu-dwarf-low-level-information` and 2053:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`. 2054 2055.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 2056 2057``DW_AT_LLVM_lane_pc`` 2058~~~~~~~~~~~~~~~~~~~~~~ 2059 2060For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 2061location of the separate lanes of a SIMT thread. 2062 2063If the lane is an active lane then this will be the same as the current program 2064location. 2065 2066If the lane is inactive, but was active on entry to the subprogram, then this is 2067the program location in the subprogram at which execution of the lane is 2068conceptual positioned. 2069 2070If the lane was not active on entry to the subprogram, then this will be the 2071undefined location. A client debugger can check if the lane is part of a valid 2072work-group by checking that the lane is in the range of the associated 2073work-group within the grid, accounting for partial work-groups. If it is not, 2074then the debugger can omit any information for the lane. Otherwise, the debugger 2075may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 2076calling subprogram until it finds a non-undefined location. Conceptually the 2077lane only has the call frames that it has a non-undefined 2078``DW_AT_LLVM_lane_pc``. 2079 2080The following example illustrates how the AMDGPU backend can generate a DWARF 2081location list expression for the nested ``IF/THEN/ELSE`` structures of the 2082following subprogram pseudo code for a target with 64 lanes per wavefront. 2083 2084.. code:: 2085 :number-lines: 2086 2087 SUBPROGRAM X 2088 BEGIN 2089 a; 2090 IF (c1) THEN 2091 b; 2092 IF (c2) THEN 2093 c; 2094 ELSE 2095 d; 2096 ENDIF 2097 e; 2098 ELSE 2099 f; 2100 ENDIF 2101 g; 2102 END 2103 2104The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2105execution mask (``EXEC``) to linearize the control flow. The condition is 2106evaluated to make a mask of the lanes for which the condition evaluates to true. 2107First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2108logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2109``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2110the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2111region the ``EXEC`` mask is restored to the value it had at the beginning of the 2112region. This is shown below. Other approaches are possible, but the basic 2113concept is the same. 2114 2115.. code:: 2116 :number-lines: 2117 2118 $lex_start: 2119 a; 2120 %1 = EXEC 2121 %2 = c1 2122 $lex_1_start: 2123 EXEC = %1 & %2 2124 $if_1_then: 2125 b; 2126 %3 = EXEC 2127 %4 = c2 2128 $lex_1_1_start: 2129 EXEC = %3 & %4 2130 $lex_1_1_then: 2131 c; 2132 EXEC = ~EXEC & %3 2133 $lex_1_1_else: 2134 d; 2135 EXEC = %3 2136 $lex_1_1_end: 2137 e; 2138 EXEC = ~EXEC & %1 2139 $lex_1_else: 2140 f; 2141 EXEC = %1 2142 $lex_1_end: 2143 g; 2144 $lex_end: 2145 2146To create the DWARF location list expression that defines the location 2147description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2148pseudo instruction can be used to annotate the linearized control flow. This can 2149be done by defining an artificial variable for the lane PC. The DWARF location 2150list expression created for it is used as the value of the 2151``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2152 2153A DWARF procedure is defined for each well nested structured control flow region 2154which provides the conceptual lane program location for a lane if it is not 2155active (namely it is divergent). The DWARF operation expression for each region 2156conceptually inherits the value of the immediately enclosing region and modifies 2157it according to the semantics of the region. 2158 2159For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2160the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2161region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2162region since the ``THEN`` region has completed. 2163 2164The lane PC artificial variable is assigned at each region transition. It uses 2165the immediately enclosing region's DWARF procedure to compute the program 2166location for each lane assuming they are divergent, and then modifies the result 2167by inserting the current program location for each lane that the ``EXEC`` mask 2168indicates is active. 2169 2170By having separate DWARF procedures for each region, they can be reused to 2171define the value for any nested region. This reduces the total size of the DWARF 2172operation expressions. 2173 2174The following provides an example using pseudo LLVM MIR. 2175 2176.. code:: 2177 :number-lines: 2178 2179 $lex_start: 2180 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2181 DW_AT_name = "__uint64"; 2182 DW_AT_byte_size = 8; 2183 DW_AT_encoding = DW_ATE_unsigned; 2184 ]; 2185 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2186 DW_AT_name = "__active_lane_pc"; 2187 DW_AT_location = [ 2188 DW_OP_regx PC; 2189 DW_OP_LLVM_extend 64, 64; 2190 DW_OP_regval_type EXEC, %uint_64; 2191 DW_OP_LLVM_select_bit_piece 64, 64; 2192 ]; 2193 ]; 2194 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2195 DW_AT_name = "__divergent_lane_pc"; 2196 DW_AT_location = [ 2197 DW_OP_LLVM_undefined; 2198 DW_OP_LLVM_extend 64, 64; 2199 ]; 2200 ]; 2201 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2202 DW_OP_call_ref %__divergent_lane_pc; 2203 DW_OP_call_ref %__active_lane_pc; 2204 ]; 2205 a; 2206 %1 = EXEC; 2207 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2208 %2 = c1; 2209 $lex_1_start: 2210 EXEC = %1 & %2; 2211 $lex_1_then: 2212 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2213 DW_AT_name = "__divergent_lane_pc_1_then"; 2214 DW_AT_location = DIExpression[ 2215 DW_OP_call_ref %__divergent_lane_pc; 2216 DW_OP_addrx &lex_1_start; 2217 DW_OP_stack_value; 2218 DW_OP_LLVM_extend 64, 64; 2219 DW_OP_call_ref %__lex_1_save_exec; 2220 DW_OP_deref_type 64, %__uint_64; 2221 DW_OP_LLVM_select_bit_piece 64, 64; 2222 ]; 2223 ]; 2224 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2225 DW_OP_call_ref %__divergent_lane_pc_1_then; 2226 DW_OP_call_ref %__active_lane_pc; 2227 ]; 2228 b; 2229 %3 = EXEC; 2230 DBG_VALUE %3, %__lex_1_1_save_exec; 2231 %4 = c2; 2232 $lex_1_1_start: 2233 EXEC = %3 & %4; 2234 $lex_1_1_then: 2235 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2236 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2237 DW_AT_location = DIExpression[ 2238 DW_OP_call_ref %__divergent_lane_pc_1_then; 2239 DW_OP_addrx &lex_1_1_start; 2240 DW_OP_stack_value; 2241 DW_OP_LLVM_extend 64, 64; 2242 DW_OP_call_ref %__lex_1_1_save_exec; 2243 DW_OP_deref_type 64, %__uint_64; 2244 DW_OP_LLVM_select_bit_piece 64, 64; 2245 ]; 2246 ]; 2247 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2248 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2249 DW_OP_call_ref %__active_lane_pc; 2250 ]; 2251 c; 2252 EXEC = ~EXEC & %3; 2253 $lex_1_1_else: 2254 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2255 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2256 DW_AT_location = DIExpression[ 2257 DW_OP_call_ref %__divergent_lane_pc_1_then; 2258 DW_OP_addrx &lex_1_1_end; 2259 DW_OP_stack_value; 2260 DW_OP_LLVM_extend 64, 64; 2261 DW_OP_call_ref %__lex_1_1_save_exec; 2262 DW_OP_deref_type 64, %__uint_64; 2263 DW_OP_LLVM_select_bit_piece 64, 64; 2264 ]; 2265 ]; 2266 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2267 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2268 DW_OP_call_ref %__active_lane_pc; 2269 ]; 2270 d; 2271 EXEC = %3; 2272 $lex_1_1_end: 2273 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2274 DW_OP_call_ref %__divergent_lane_pc; 2275 DW_OP_call_ref %__active_lane_pc; 2276 ]; 2277 e; 2278 EXEC = ~EXEC & %1; 2279 $lex_1_else: 2280 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2281 DW_AT_name = "__divergent_lane_pc_1_else"; 2282 DW_AT_location = DIExpression[ 2283 DW_OP_call_ref %__divergent_lane_pc; 2284 DW_OP_addrx &lex_1_end; 2285 DW_OP_stack_value; 2286 DW_OP_LLVM_extend 64, 64; 2287 DW_OP_call_ref %__lex_1_save_exec; 2288 DW_OP_deref_type 64, %__uint_64; 2289 DW_OP_LLVM_select_bit_piece 64, 64; 2290 ]; 2291 ]; 2292 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2293 DW_OP_call_ref %__divergent_lane_pc_1_else; 2294 DW_OP_call_ref %__active_lane_pc; 2295 ]; 2296 f; 2297 EXEC = %1; 2298 $lex_1_end: 2299 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2300 DW_OP_call_ref %__divergent_lane_pc; 2301 DW_OP_call_ref %__active_lane_pc; 2302 ]; 2303 g; 2304 $lex_end: 2305 2306The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2307that are active, with the current program location. 2308 2309Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2310the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2311instruction, location list entries will be created that describe where the 2312artificial variables are allocated at any given program location. The compiler 2313may allocate them to registers or spill them to memory. 2314 2315The DWARF procedures for each region use the values of the saved execution mask 2316artificial variables to only update the lanes that are active on entry to the 2317region. All other lanes retain the value of the enclosing region where they were 2318last active. If they were not active on entry to the subprogram, then will have 2319the undefined location description. 2320 2321Other structured control flow regions can be handled similarly. For example, 2322loops would set the divergent program location for the region at the end of the 2323loop. Any lanes active will be in the loop, and any lanes not active must have 2324exited the loop. 2325 2326An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2327``IF/THEN/ELSE`` regions. 2328 2329The DWARF procedures can use the active lane artificial variable described in 2330:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2331``EXEC`` mask in order to support whole or quad wavefront mode. 2332 2333.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2334 2335``DW_AT_LLVM_active_lane`` 2336~~~~~~~~~~~~~~~~~~~~~~~~~~ 2337 2338The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2339entry is used to specify the lanes that are conceptually active for a SIMT 2340thread. 2341 2342The execution mask may be modified to implement whole or quad wavefront mode 2343operations. For example, all lanes may need to temporarily be made active to 2344execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2345update it to enable the necessary lanes, perform the operations, and then 2346restore the ``EXEC`` mask from the saved value. While executing the whole 2347wavefront region, the conceptual execution mask is the saved value, not the 2348``EXEC`` value. 2349 2350This is handled by defining an artificial variable for the active lane mask. The 2351active lane mask artificial variable would be the actual ``EXEC`` mask for 2352normal regions, and the saved execution mask for regions where the mask is 2353temporarily updated. The location list expression created for this artificial 2354variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2355attribute. 2356 2357``DW_AT_LLVM_augmentation`` 2358~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2359 2360For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2361debugger information entry has the following value for the augmentation string: 2362 2363:: 2364 2365 [amdgpu:v0.0] 2366 2367The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2368extensions used in the DWARF of the compilation unit. The version number 2369conforms to [SEMVER]_. 2370 2371Call Frame Information 2372---------------------- 2373 2374DWARF Call Frame Information (CFI) describes how a consumer can virtually 2375*unwind* call frames in a running process or core dump. See DWARF Version 5 2376section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2377 2378For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2379 23801. ``augmentation`` string contains the following null-terminated UTF-8 string: 2381 2382 :: 2383 2384 [amd:v0.0] 2385 2386 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2387 extensions used in this CIE or to the FDEs that use it. The version number 2388 conforms to [SEMVER]_. 2389 23902. ``address_size`` for the ``Global`` address space is defined in 2391 :ref:`amdgpu-dwarf-address-space-identifier`. 2392 23933. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2394 23954. ``code_alignment_factor`` is 4 bytes. 2396 2397 .. TODO:: 2398 2399 Add to :ref:`amdgpu-processor-table` table. 2400 24015. ``data_alignment_factor`` is 4 bytes. 2402 2403 .. TODO:: 2404 2405 Add to :ref:`amdgpu-processor-table` table. 2406 24076. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2408 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2409 24107. ``initial_instructions`` Since a subprogram X with fewer registers can be 2411 called from subprogram Y that has more allocated, X will not change any of 2412 the extra registers as it cannot access them. Therefore, the default rule 2413 for all columns is ``same value``. 2414 2415For AMDGPU the register number follows the numbering defined in 2416:ref:`amdgpu-dwarf-register-identifier`. 2417 2418For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2419the return address to get the address of a byte within the call site 2420instructions. See DWARF Version 5 section 6.4.4. 2421 2422Accelerated Access 2423------------------ 2424 2425See DWARF Version 5 section 6.1. 2426 2427Lookup By Name Section Header 2428~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2429 2430See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2431 2432For AMDGPU the lookup by name section header table: 2433 2434``augmentation_string_size`` (uword) 2435 2436 Set to the length of the ``augmentation_string`` value which is always a 2437 multiple of 4. 2438 2439``augmentation_string`` (sequence of UTF-8 characters) 2440 2441 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2442 2443 :: 2444 2445 [amdgpu:v0.0] 2446 2447 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2448 extensions used in the DWARF of this index. The version number conforms to 2449 [SEMVER]_. 2450 2451 .. note:: 2452 2453 This is different to the DWARF Version 5 definition that requires the first 2454 4 characters to be the vendor ID. But this is consistent with the other 2455 augmentation strings and does allow multiple vendor contributions. However, 2456 backwards compatibility may be more desirable. 2457 2458Lookup By Address Section Header 2459~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2460 2461See DWARF Version 5 section 6.1.2. 2462 2463For AMDGPU the lookup by address section header table: 2464 2465``address_size`` (ubyte) 2466 2467 Match the address size for the ``Global`` address space defined in 2468 :ref:`amdgpu-dwarf-address-space-identifier`. 2469 2470``segment_selector_size`` (ubyte) 2471 2472 AMDGPU does not use a segment selector so this is 0. The entries in the 2473 ``.debug_aranges`` do not have a segment selector. 2474 2475Line Number Information 2476----------------------- 2477 2478See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2479 2480AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2481The instruction set must be obtained from the ELF file header ``e_flags`` field 2482in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2483<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2484 2485.. TODO:: 2486 2487 Should the ``isa`` state machine register be used to indicate if the code is 2488 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2489 2490For AMDGPU the line number program header fields have the following values (see 2491DWARF Version 5 section 6.2.4): 2492 2493``address_size`` (ubyte) 2494 Matches the address size for the ``Global`` address space defined in 2495 :ref:`amdgpu-dwarf-address-space-identifier`. 2496 2497``segment_selector_size`` (ubyte) 2498 AMDGPU does not use a segment selector so this is 0. 2499 2500``minimum_instruction_length`` (ubyte) 2501 For GFX9-GFX10 this is 4. 2502 2503``maximum_operations_per_instruction`` (ubyte) 2504 For GFX9-GFX10 this is 1. 2505 2506Source text for online-compiled programs (for example, those compiled by the 2507OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2508See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2509Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2510<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2511 2512The Clang option used to control source embedding in AMDGPU is defined in 2513:ref:`amdgpu-clang-debug-options-table`. 2514 2515 .. table:: AMDGPU Clang Debug Options 2516 :name: amdgpu-clang-debug-options-table 2517 2518 ==================== ================================================== 2519 Debug Flag Description 2520 ==================== ================================================== 2521 -g[no-]embed-source Enable/disable embedding source text in DWARF 2522 debug sections. Useful for environments where 2523 source cannot be written to disk, such as 2524 when performing online compilation. 2525 ==================== ================================================== 2526 2527For example: 2528 2529``-gembed-source`` 2530 Enable the embedded source. 2531 2532``-gno-embed-source`` 2533 Disable the embedded source. 2534 253532-Bit and 64-Bit DWARF Formats 2536------------------------------- 2537 2538See DWARF Version 5 section 7.4 and 2539:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2540 2541For AMDGPU: 2542 2543* For the ``amdgcn`` target architecture only the 64-bit process address space 2544 is supported. 2545 2546* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2547 the 32-bit DWARF format. 2548 2549Unit Headers 2550------------ 2551 2552For AMDGPU the following values apply for each of the unit headers described in 2553DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2554 2555``address_size`` (ubyte) 2556 Matches the address size for the ``Global`` address space defined in 2557 :ref:`amdgpu-dwarf-address-space-identifier`. 2558 2559.. _amdgpu-code-conventions: 2560 2561Code Conventions 2562================ 2563 2564This section provides code conventions used for each supported target triple OS 2565(see :ref:`amdgpu-target-triples`). 2566 2567AMDHSA 2568------ 2569 2570This section provides code conventions used when the target triple OS is 2571``amdhsa`` (see :ref:`amdgpu-target-triples`). 2572 2573.. _amdgpu-amdhsa-code-object-metadata: 2574 2575Code Object Metadata 2576~~~~~~~~~~~~~~~~~~~~ 2577 2578The code object metadata specifies extensible metadata associated with the code 2579objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2580encoding and semantics of this metadata depends on the code object version; see 2581:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2582:ref:`amdgpu-amdhsa-code-object-metadata-v3`, 2583:ref:`amdgpu-amdhsa-code-object-metadata-v4` and 2584:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 2585 2586Code object metadata is specified in a note record (see 2587:ref:`amdgpu-note-records`) and is required when the target triple OS is 2588``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2589information necessary to support the HSA compatible runtime kernel queries. For 2590example, the segment sizes needed in a dispatch packet. In addition, a 2591high-level language runtime may require other information to be included. For 2592example, the AMD OpenCL runtime records kernel argument information. 2593 2594.. _amdgpu-amdhsa-code-object-metadata-v2: 2595 2596Code Object V2 Metadata 2597+++++++++++++++++++++++ 2598 2599.. warning:: 2600 Code object V2 is not the default code object version emitted by this version 2601 of LLVM. 2602 2603Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2604(see :ref:`amdgpu-note-records-v2`). 2605 2606The metadata is specified as a YAML formatted string (see [YAML]_ and 2607:doc:`YamlIO`). 2608 2609.. TODO:: 2610 2611 Is the string null terminated? It probably should not if YAML allows it to 2612 contain null characters, otherwise it should be. 2613 2614The metadata is represented as a single YAML document comprised of the mapping 2615defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2616referenced tables. 2617 2618For boolean values, the string values of ``false`` and ``true`` are used for 2619false and true respectively. 2620 2621Additional information can be added to the mappings. To avoid conflicts, any 2622non-AMD key names should be prefixed by "*vendor-name*.". 2623 2624 .. table:: AMDHSA Code Object V2 Metadata Map 2625 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2626 2627 ========== ============== ========= ======================================= 2628 String Key Value Type Required? Description 2629 ========== ============== ========= ======================================= 2630 "Version" sequence of Required - The first integer is the major 2631 2 integers version. Currently 1. 2632 - The second integer is the minor 2633 version. Currently 0. 2634 "Printf" sequence of Each string is encoded information 2635 strings about a printf function call. The 2636 encoded information is organized as 2637 fields separated by colon (':'): 2638 2639 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2640 2641 where: 2642 2643 ``ID`` 2644 A 32-bit integer as a unique id for 2645 each printf function call 2646 2647 ``N`` 2648 A 32-bit integer equal to the number 2649 of arguments of printf function call 2650 minus 1 2651 2652 ``S[i]`` (where i = 0, 1, ... , N-1) 2653 32-bit integers for the size in bytes 2654 of the i-th FormatString argument of 2655 the printf function call 2656 2657 FormatString 2658 The format string passed to the 2659 printf function call. 2660 "Kernels" sequence of Required Sequence of the mappings for each 2661 mapping kernel in the code object. See 2662 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2663 for the definition of the mapping. 2664 ========== ============== ========= ======================================= 2665 2666.. 2667 2668 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2669 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2670 2671 ================= ============== ========= ================================ 2672 String Key Value Type Required? Description 2673 ================= ============== ========= ================================ 2674 "Name" string Required Source name of the kernel. 2675 "SymbolName" string Required Name of the kernel 2676 descriptor ELF symbol. 2677 "Language" string Source language of the kernel. 2678 Values include: 2679 2680 - "OpenCL C" 2681 - "OpenCL C++" 2682 - "HCC" 2683 - "OpenMP" 2684 2685 "LanguageVersion" sequence of - The first integer is the major 2686 2 integers version. 2687 - The second integer is the 2688 minor version. 2689 "Attrs" mapping Mapping of kernel attributes. 2690 See 2691 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2692 for the mapping definition. 2693 "Args" sequence of Sequence of mappings of the 2694 mapping kernel arguments. See 2695 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2696 for the definition of the mapping. 2697 "CodeProps" mapping Mapping of properties related to 2698 the kernel code. See 2699 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2700 for the mapping definition. 2701 ================= ============== ========= ================================ 2702 2703.. 2704 2705 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2706 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2707 2708 =================== ============== ========= ============================== 2709 String Key Value Type Required? Description 2710 =================== ============== ========= ============================== 2711 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2712 3 integers must be >=1 and the dispatch 2713 work-group size X, Y, Z must 2714 correspond to the specified 2715 values. Defaults to 0, 0, 0. 2716 2717 Corresponds to the OpenCL 2718 ``reqd_work_group_size`` 2719 attribute. 2720 "WorkGroupSizeHint" sequence of The dispatch work-group size 2721 3 integers X, Y, Z is likely to be the 2722 specified values. 2723 2724 Corresponds to the OpenCL 2725 ``work_group_size_hint`` 2726 attribute. 2727 "VecTypeHint" string The name of a scalar or vector 2728 type. 2729 2730 Corresponds to the OpenCL 2731 ``vec_type_hint`` attribute. 2732 2733 "RuntimeHandle" string The external symbol name 2734 associated with a kernel. 2735 OpenCL runtime allocates a 2736 global buffer for the symbol 2737 and saves the kernel's address 2738 to it, which is used for 2739 device side enqueueing. Only 2740 available for device side 2741 enqueued kernels. 2742 =================== ============== ========= ============================== 2743 2744.. 2745 2746 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2747 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2748 2749 ================= ============== ========= ================================ 2750 String Key Value Type Required? Description 2751 ================= ============== ========= ================================ 2752 "Name" string Kernel argument name. 2753 "TypeName" string Kernel argument type name. 2754 "Size" integer Required Kernel argument size in bytes. 2755 "Align" integer Required Kernel argument alignment in 2756 bytes. Must be a power of two. 2757 "ValueKind" string Required Kernel argument kind that 2758 specifies how to set up the 2759 corresponding argument. 2760 Values include: 2761 2762 "ByValue" 2763 The argument is copied 2764 directly into the kernarg. 2765 2766 "GlobalBuffer" 2767 A global address space pointer 2768 to the buffer data is passed 2769 in the kernarg. 2770 2771 "DynamicSharedPointer" 2772 A group address space pointer 2773 to dynamically allocated LDS 2774 is passed in the kernarg. 2775 2776 "Sampler" 2777 A global address space 2778 pointer to a S# is passed in 2779 the kernarg. 2780 2781 "Image" 2782 A global address space 2783 pointer to a T# is passed in 2784 the kernarg. 2785 2786 "Pipe" 2787 A global address space pointer 2788 to an OpenCL pipe is passed in 2789 the kernarg. 2790 2791 "Queue" 2792 A global address space pointer 2793 to an OpenCL device enqueue 2794 queue is passed in the 2795 kernarg. 2796 2797 "HiddenGlobalOffsetX" 2798 The OpenCL grid dispatch 2799 global offset for the X 2800 dimension is passed in the 2801 kernarg. 2802 2803 "HiddenGlobalOffsetY" 2804 The OpenCL grid dispatch 2805 global offset for the Y 2806 dimension is passed in the 2807 kernarg. 2808 2809 "HiddenGlobalOffsetZ" 2810 The OpenCL grid dispatch 2811 global offset for the Z 2812 dimension is passed in the 2813 kernarg. 2814 2815 "HiddenNone" 2816 An argument that is not used 2817 by the kernel. Space needs to 2818 be left for it, but it does 2819 not need to be set up. 2820 2821 "HiddenPrintfBuffer" 2822 A global address space pointer 2823 to the runtime printf buffer 2824 is passed in kernarg. 2825 2826 "HiddenHostcallBuffer" 2827 A global address space pointer 2828 to the runtime hostcall buffer 2829 is passed in kernarg. 2830 2831 "HiddenDefaultQueue" 2832 A global address space pointer 2833 to the OpenCL device enqueue 2834 queue that should be used by 2835 the kernel by default is 2836 passed in the kernarg. 2837 2838 "HiddenCompletionAction" 2839 A global address space pointer 2840 to help link enqueued kernels into 2841 the ancestor tree for determining 2842 when the parent kernel has finished. 2843 2844 "HiddenMultiGridSyncArg" 2845 A global address space pointer for 2846 multi-grid synchronization is 2847 passed in the kernarg. 2848 2849 "ValueType" string Unused and deprecated. This should no longer 2850 be emitted, but is accepted for compatibility. 2851 2852 2853 "PointeeAlign" integer Alignment in bytes of pointee 2854 type for pointer type kernel 2855 argument. Must be a power 2856 of 2. Only present if 2857 "ValueKind" is 2858 "DynamicSharedPointer". 2859 "AddrSpaceQual" string Kernel argument address space 2860 qualifier. Only present if 2861 "ValueKind" is "GlobalBuffer" or 2862 "DynamicSharedPointer". Values 2863 are: 2864 2865 - "Private" 2866 - "Global" 2867 - "Constant" 2868 - "Local" 2869 - "Generic" 2870 - "Region" 2871 2872 .. TODO:: 2873 2874 Is GlobalBuffer only Global 2875 or Constant? Is 2876 DynamicSharedPointer always 2877 Local? Can HCC allow Generic? 2878 How can Private or Region 2879 ever happen? 2880 2881 "AccQual" string Kernel argument access 2882 qualifier. Only present if 2883 "ValueKind" is "Image" or 2884 "Pipe". Values 2885 are: 2886 2887 - "ReadOnly" 2888 - "WriteOnly" 2889 - "ReadWrite" 2890 2891 .. TODO:: 2892 2893 Does this apply to 2894 GlobalBuffer? 2895 2896 "ActualAccQual" string The actual memory accesses 2897 performed by the kernel on the 2898 kernel argument. Only present if 2899 "ValueKind" is "GlobalBuffer", 2900 "Image", or "Pipe". This may be 2901 more restrictive than indicated 2902 by "AccQual" to reflect what the 2903 kernel actual does. If not 2904 present then the runtime must 2905 assume what is implied by 2906 "AccQual" and "IsConst". Values 2907 are: 2908 2909 - "ReadOnly" 2910 - "WriteOnly" 2911 - "ReadWrite" 2912 2913 "IsConst" boolean Indicates if the kernel argument 2914 is const qualified. Only present 2915 if "ValueKind" is 2916 "GlobalBuffer". 2917 2918 "IsRestrict" boolean Indicates if the kernel argument 2919 is restrict qualified. Only 2920 present if "ValueKind" is 2921 "GlobalBuffer". 2922 2923 "IsVolatile" boolean Indicates if the kernel argument 2924 is volatile qualified. Only 2925 present if "ValueKind" is 2926 "GlobalBuffer". 2927 2928 "IsPipe" boolean Indicates if the kernel argument 2929 is pipe qualified. Only present 2930 if "ValueKind" is "Pipe". 2931 2932 .. TODO:: 2933 2934 Can GlobalBuffer be pipe 2935 qualified? 2936 2937 ================= ============== ========= ================================ 2938 2939.. 2940 2941 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2942 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2943 2944 ============================ ============== ========= ===================== 2945 String Key Value Type Required? Description 2946 ============================ ============== ========= ===================== 2947 "KernargSegmentSize" integer Required The size in bytes of 2948 the kernarg segment 2949 that holds the values 2950 of the arguments to 2951 the kernel. 2952 "GroupSegmentFixedSize" integer Required The amount of group 2953 segment memory 2954 required by a 2955 work-group in 2956 bytes. This does not 2957 include any 2958 dynamically allocated 2959 group segment memory 2960 that may be added 2961 when the kernel is 2962 dispatched. 2963 "PrivateSegmentFixedSize" integer Required The amount of fixed 2964 private address space 2965 memory required for a 2966 work-item in 2967 bytes. If the kernel 2968 uses a dynamic call 2969 stack then additional 2970 space must be added 2971 to this value for the 2972 call stack. 2973 "KernargSegmentAlign" integer Required The maximum byte 2974 alignment of 2975 arguments in the 2976 kernarg segment. Must 2977 be a power of 2. 2978 "WavefrontSize" integer Required Wavefront size. Must 2979 be a power of 2. 2980 "NumSGPRs" integer Required Number of scalar 2981 registers used by a 2982 wavefront for 2983 GFX6-GFX10. This 2984 includes the special 2985 SGPRs for VCC, Flat 2986 Scratch (GFX7-GFX10) 2987 and XNACK (for 2988 GFX8-GFX10). It does 2989 not include the 16 2990 SGPR added if a trap 2991 handler is 2992 enabled. It is not 2993 rounded up to the 2994 allocation 2995 granularity. 2996 "NumVGPRs" integer Required Number of vector 2997 registers used by 2998 each work-item for 2999 GFX6-GFX10 3000 "MaxFlatWorkGroupSize" integer Required Maximum flat 3001 work-group size 3002 supported by the 3003 kernel in work-items. 3004 Must be >=1 and 3005 consistent with 3006 ReqdWorkGroupSize if 3007 not 0, 0, 0. 3008 "NumSpilledSGPRs" integer Number of stores from 3009 a scalar register to 3010 a register allocator 3011 created spill 3012 location. 3013 "NumSpilledVGPRs" integer Number of stores from 3014 a vector register to 3015 a register allocator 3016 created spill 3017 location. 3018 ============================ ============== ========= ===================== 3019 3020.. _amdgpu-amdhsa-code-object-metadata-v3: 3021 3022Code Object V3 Metadata 3023+++++++++++++++++++++++ 3024 3025.. warning:: 3026 Code object V3 is not the default code object version emitted by this version 3027 of LLVM. 3028 3029Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note 3030record (see :ref:`amdgpu-note-records-v3-onwards`). 3031 3032The metadata is represented as Message Pack formatted binary data (see 3033[MsgPack]_). The top level is a Message Pack map that includes the 3034keys defined in table 3035:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 3036tables. 3037 3038Additional information can be added to the maps. To avoid conflicts, 3039any key names should be prefixed by "*vendor-name*." where 3040``vendor-name`` can be the name of the vendor and specific vendor 3041tool that generates the information. The prefix is abbreviated to 3042simply "." when it appears within a map that has been added by the 3043same *vendor-name*. 3044 3045 .. table:: AMDHSA Code Object V3 Metadata Map 3046 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 3047 3048 ================= ============== ========= ======================================= 3049 String Key Value Type Required? Description 3050 ================= ============== ========= ======================================= 3051 "amdhsa.version" sequence of Required - The first integer is the major 3052 2 integers version. Currently 1. 3053 - The second integer is the minor 3054 version. Currently 0. 3055 "amdhsa.printf" sequence of Each string is encoded information 3056 strings about a printf function call. The 3057 encoded information is organized as 3058 fields separated by colon (':'): 3059 3060 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 3061 3062 where: 3063 3064 ``ID`` 3065 A 32-bit integer as a unique id for 3066 each printf function call 3067 3068 ``N`` 3069 A 32-bit integer equal to the number 3070 of arguments of printf function call 3071 minus 1 3072 3073 ``S[i]`` (where i = 0, 1, ... , N-1) 3074 32-bit integers for the size in bytes 3075 of the i-th FormatString argument of 3076 the printf function call 3077 3078 FormatString 3079 The format string passed to the 3080 printf function call. 3081 "amdhsa.kernels" sequence of Required Sequence of the maps for each 3082 map kernel in the code object. See 3083 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 3084 for the definition of the keys included 3085 in that map. 3086 ================= ============== ========= ======================================= 3087 3088.. 3089 3090 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 3091 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 3092 3093 =================================== ============== ========= ================================ 3094 String Key Value Type Required? Description 3095 =================================== ============== ========= ================================ 3096 ".name" string Required Source name of the kernel. 3097 ".symbol" string Required Name of the kernel 3098 descriptor ELF symbol. 3099 ".language" string Source language of the kernel. 3100 Values include: 3101 3102 - "OpenCL C" 3103 - "OpenCL C++" 3104 - "HCC" 3105 - "HIP" 3106 - "OpenMP" 3107 - "Assembler" 3108 3109 ".language_version" sequence of - The first integer is the major 3110 2 integers version. 3111 - The second integer is the 3112 minor version. 3113 ".args" sequence of Sequence of maps of the 3114 map kernel arguments. See 3115 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3116 for the definition of the keys 3117 included in that map. 3118 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3119 3 integers must be >=1 and the dispatch 3120 work-group size X, Y, Z must 3121 correspond to the specified 3122 values. Defaults to 0, 0, 0. 3123 3124 Corresponds to the OpenCL 3125 ``reqd_work_group_size`` 3126 attribute. 3127 ".workgroup_size_hint" sequence of The dispatch work-group size 3128 3 integers X, Y, Z is likely to be the 3129 specified values. 3130 3131 Corresponds to the OpenCL 3132 ``work_group_size_hint`` 3133 attribute. 3134 ".vec_type_hint" string The name of a scalar or vector 3135 type. 3136 3137 Corresponds to the OpenCL 3138 ``vec_type_hint`` attribute. 3139 3140 ".device_enqueue_symbol" string The external symbol name 3141 associated with a kernel. 3142 OpenCL runtime allocates a 3143 global buffer for the symbol 3144 and saves the kernel's address 3145 to it, which is used for 3146 device side enqueueing. Only 3147 available for device side 3148 enqueued kernels. 3149 ".kernarg_segment_size" integer Required The size in bytes of 3150 the kernarg segment 3151 that holds the values 3152 of the arguments to 3153 the kernel. 3154 ".group_segment_fixed_size" integer Required The amount of group 3155 segment memory 3156 required by a 3157 work-group in 3158 bytes. This does not 3159 include any 3160 dynamically allocated 3161 group segment memory 3162 that may be added 3163 when the kernel is 3164 dispatched. 3165 ".private_segment_fixed_size" integer Required The amount of fixed 3166 private address space 3167 memory required for a 3168 work-item in 3169 bytes. If the kernel 3170 uses a dynamic call 3171 stack then additional 3172 space must be added 3173 to this value for the 3174 call stack. 3175 ".kernarg_segment_align" integer Required The maximum byte 3176 alignment of 3177 arguments in the 3178 kernarg segment. Must 3179 be a power of 2. 3180 ".wavefront_size" integer Required Wavefront size. Must 3181 be a power of 2. 3182 ".sgpr_count" integer Required Number of scalar 3183 registers required by a 3184 wavefront for 3185 GFX6-GFX9. A register 3186 is required if it is 3187 used explicitly, or 3188 if a higher numbered 3189 register is used 3190 explicitly. This 3191 includes the special 3192 SGPRs for VCC, Flat 3193 Scratch (GFX7-GFX9) 3194 and XNACK (for 3195 GFX8-GFX9). It does 3196 not include the 16 3197 SGPR added if a trap 3198 handler is 3199 enabled. It is not 3200 rounded up to the 3201 allocation 3202 granularity. 3203 ".vgpr_count" integer Required Number of vector 3204 registers required by 3205 each work-item for 3206 GFX6-GFX9. A register 3207 is required if it is 3208 used explicitly, or 3209 if a higher numbered 3210 register is used 3211 explicitly. 3212 ".agpr_count" integer Required Number of accumulator 3213 registers required by 3214 each work-item for 3215 GFX90A, GFX908. 3216 ".max_flat_workgroup_size" integer Required Maximum flat 3217 work-group size 3218 supported by the 3219 kernel in work-items. 3220 Must be >=1 and 3221 consistent with 3222 ReqdWorkGroupSize if 3223 not 0, 0, 0. 3224 ".sgpr_spill_count" integer Number of stores from 3225 a scalar register to 3226 a register allocator 3227 created spill 3228 location. 3229 ".vgpr_spill_count" integer Number of stores from 3230 a vector register to 3231 a register allocator 3232 created spill 3233 location. 3234 ".kind" string The kind of the kernel 3235 with the following 3236 values: 3237 3238 "normal" 3239 Regular kernels. 3240 3241 "init" 3242 These kernels must be 3243 invoked after loading 3244 the containing code 3245 object and must 3246 complete before any 3247 normal and fini 3248 kernels in the same 3249 code object are 3250 invoked. 3251 3252 "fini" 3253 These kernels must be 3254 invoked before 3255 unloading the 3256 containing code object 3257 and after all init and 3258 normal kernels in the 3259 same code object have 3260 been invoked and 3261 completed. 3262 3263 If omitted, "normal" is 3264 assumed. 3265 =================================== ============== ========= ================================ 3266 3267.. 3268 3269 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3270 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3271 3272 ====================== ============== ========= ================================ 3273 String Key Value Type Required? Description 3274 ====================== ============== ========= ================================ 3275 ".name" string Kernel argument name. 3276 ".type_name" string Kernel argument type name. 3277 ".size" integer Required Kernel argument size in bytes. 3278 ".offset" integer Required Kernel argument offset in 3279 bytes. The offset must be a 3280 multiple of the alignment 3281 required by the argument. 3282 ".value_kind" string Required Kernel argument kind that 3283 specifies how to set up the 3284 corresponding argument. 3285 Values include: 3286 3287 "by_value" 3288 The argument is copied 3289 directly into the kernarg. 3290 3291 "global_buffer" 3292 A global address space pointer 3293 to the buffer data is passed 3294 in the kernarg. 3295 3296 "dynamic_shared_pointer" 3297 A group address space pointer 3298 to dynamically allocated LDS 3299 is passed in the kernarg. 3300 3301 "sampler" 3302 A global address space 3303 pointer to a S# is passed in 3304 the kernarg. 3305 3306 "image" 3307 A global address space 3308 pointer to a T# is passed in 3309 the kernarg. 3310 3311 "pipe" 3312 A global address space pointer 3313 to an OpenCL pipe is passed in 3314 the kernarg. 3315 3316 "queue" 3317 A global address space pointer 3318 to an OpenCL device enqueue 3319 queue is passed in the 3320 kernarg. 3321 3322 "hidden_global_offset_x" 3323 The OpenCL grid dispatch 3324 global offset for the X 3325 dimension is passed in the 3326 kernarg. 3327 3328 "hidden_global_offset_y" 3329 The OpenCL grid dispatch 3330 global offset for the Y 3331 dimension is passed in the 3332 kernarg. 3333 3334 "hidden_global_offset_z" 3335 The OpenCL grid dispatch 3336 global offset for the Z 3337 dimension is passed in the 3338 kernarg. 3339 3340 "hidden_none" 3341 An argument that is not used 3342 by the kernel. Space needs to 3343 be left for it, but it does 3344 not need to be set up. 3345 3346 "hidden_printf_buffer" 3347 A global address space pointer 3348 to the runtime printf buffer 3349 is passed in kernarg. 3350 3351 "hidden_hostcall_buffer" 3352 A global address space pointer 3353 to the runtime hostcall buffer 3354 is passed in kernarg. 3355 3356 "hidden_default_queue" 3357 A global address space pointer 3358 to the OpenCL device enqueue 3359 queue that should be used by 3360 the kernel by default is 3361 passed in the kernarg. 3362 3363 "hidden_completion_action" 3364 A global address space pointer 3365 to help link enqueued kernels into 3366 the ancestor tree for determining 3367 when the parent kernel has finished. 3368 3369 "hidden_multigrid_sync_arg" 3370 A global address space pointer for 3371 multi-grid synchronization is 3372 passed in the kernarg. 3373 3374 ".value_type" string Unused and deprecated. This should no longer 3375 be emitted, but is accepted for compatibility. 3376 3377 ".pointee_align" integer Alignment in bytes of pointee 3378 type for pointer type kernel 3379 argument. Must be a power 3380 of 2. Only present if 3381 ".value_kind" is 3382 "dynamic_shared_pointer". 3383 ".address_space" string Kernel argument address space 3384 qualifier. Only present if 3385 ".value_kind" is "global_buffer" or 3386 "dynamic_shared_pointer". Values 3387 are: 3388 3389 - "private" 3390 - "global" 3391 - "constant" 3392 - "local" 3393 - "generic" 3394 - "region" 3395 3396 .. TODO:: 3397 3398 Is "global_buffer" only "global" 3399 or "constant"? Is 3400 "dynamic_shared_pointer" always 3401 "local"? Can HCC allow "generic"? 3402 How can "private" or "region" 3403 ever happen? 3404 3405 ".access" string Kernel argument access 3406 qualifier. Only present if 3407 ".value_kind" is "image" or 3408 "pipe". Values 3409 are: 3410 3411 - "read_only" 3412 - "write_only" 3413 - "read_write" 3414 3415 .. TODO:: 3416 3417 Does this apply to 3418 "global_buffer"? 3419 3420 ".actual_access" string The actual memory accesses 3421 performed by the kernel on the 3422 kernel argument. Only present if 3423 ".value_kind" is "global_buffer", 3424 "image", or "pipe". This may be 3425 more restrictive than indicated 3426 by ".access" to reflect what the 3427 kernel actual does. If not 3428 present then the runtime must 3429 assume what is implied by 3430 ".access" and ".is_const" . Values 3431 are: 3432 3433 - "read_only" 3434 - "write_only" 3435 - "read_write" 3436 3437 ".is_const" boolean Indicates if the kernel argument 3438 is const qualified. Only present 3439 if ".value_kind" is 3440 "global_buffer". 3441 3442 ".is_restrict" boolean Indicates if the kernel argument 3443 is restrict qualified. Only 3444 present if ".value_kind" is 3445 "global_buffer". 3446 3447 ".is_volatile" boolean Indicates if the kernel argument 3448 is volatile qualified. Only 3449 present if ".value_kind" is 3450 "global_buffer". 3451 3452 ".is_pipe" boolean Indicates if the kernel argument 3453 is pipe qualified. Only present 3454 if ".value_kind" is "pipe". 3455 3456 .. TODO:: 3457 3458 Can "global_buffer" be pipe 3459 qualified? 3460 3461 ====================== ============== ========= ================================ 3462 3463.. _amdgpu-amdhsa-code-object-metadata-v4: 3464 3465Code Object V4 Metadata 3466+++++++++++++++++++++++ 3467 3468Code object V4 metadata is the same as 3469:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3470defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`. 3471 3472 .. table:: AMDHSA Code Object V4 Metadata Map Changes 3473 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3474 3475 ================= ============== ========= ======================================= 3476 String Key Value Type Required? Description 3477 ================= ============== ========= ======================================= 3478 "amdhsa.version" sequence of Required - The first integer is the major 3479 2 integers version. Currently 1. 3480 - The second integer is the minor 3481 version. Currently 1. 3482 "amdhsa.target" string Required The target name of the code using the syntax: 3483 3484 .. code:: 3485 3486 <target-triple> [ "-" <target-id> ] 3487 3488 A canonical target ID must be 3489 used. See :ref:`amdgpu-target-triples` 3490 and :ref:`amdgpu-target-id`. 3491 ================= ============== ========= ======================================= 3492 3493.. _amdgpu-amdhsa-code-object-metadata-v5: 3494 3495Code Object V5 Metadata 3496+++++++++++++++++++++++ 3497 3498.. warning:: 3499 Code object V5 is not the default code object version emitted by this version 3500 of LLVM. 3501 3502 3503Code object V5 metadata is the same as 3504:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table 3505:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table 3506:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`. 3507 3508 .. table:: AMDHSA Code Object V5 Metadata Map Changes 3509 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5 3510 3511 ================= ============== ========= ======================================= 3512 String Key Value Type Required? Description 3513 ================= ============== ========= ======================================= 3514 "amdhsa.version" sequence of Required - The first integer is the major 3515 2 integers version. Currently 1. 3516 - The second integer is the minor 3517 version. Currently 2. 3518 ================= ============== ========= ======================================= 3519 3520.. 3521 3522 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes 3523 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5 3524 3525 ====================== ============== ========= ================================ 3526 String Key Value Type Required? Description 3527 ====================== ============== ========= ================================ 3528 ".value_kind" string Required Kernel argument kind that 3529 specifies how to set up the 3530 corresponding argument. 3531 Values include: 3532 the same as code object V3 metadata 3533 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`) 3534 with the following additions: 3535 3536 "hidden_block_count_x" 3537 The grid dispatch work-group count for the X dimension 3538 is passed in the kernarg. Some languages, such as OpenCL, 3539 support a last work-group in each dimension being partial. 3540 This count only includes the non-partial work-group count. 3541 This is not the same as the value in the AQL dispatch packet, 3542 which has the grid size in work-items. 3543 3544 "hidden_block_count_y" 3545 The grid dispatch work-group count for the Y dimension 3546 is passed in the kernarg. Some languages, such as OpenCL, 3547 support a last work-group in each dimension being partial. 3548 This count only includes the non-partial work-group count. 3549 This is not the same as the value in the AQL dispatch packet, 3550 which has the grid size in work-items. If the grid dimensionality 3551 is 1, then must be 1. 3552 3553 "hidden_block_count_z" 3554 The grid dispatch work-group count for the Z dimension 3555 is passed in the kernarg. Some languages, such as OpenCL, 3556 support a last work-group in each dimension being partial. 3557 This count only includes the non-partial work-group count. 3558 This is not the same as the value in the AQL dispatch packet, 3559 which has the grid size in work-items. If the grid dimensionality 3560 is 1 or 2, then must be 1. 3561 3562 "hidden_group_size_x" 3563 The grid dispatch work-group size for the X dimension is 3564 passed in the kernarg. This size only applies to the 3565 non-partial work-groups. This is the same value as the AQL 3566 dispatch packet work-group size. 3567 3568 "hidden_group_size_y" 3569 The grid dispatch work-group size for the Y dimension is 3570 passed in the kernarg. This size only applies to the 3571 non-partial work-groups. This is the same value as the AQL 3572 dispatch packet work-group size. If the grid dimensionality 3573 is 1, then must be 1. 3574 3575 "hidden_group_size_z" 3576 The grid dispatch work-group size for the Z dimension is 3577 passed in the kernarg. This size only applies to the 3578 non-partial work-groups. This is the same value as the AQL 3579 dispatch packet work-group size. If the grid dimensionality 3580 is 1 or 2, then must be 1. 3581 3582 "hidden_remainder_x" 3583 The grid dispatch work group size of the the partial work group 3584 of the X dimension, if it exists. Must be zero if a partial 3585 work group does not exist in the X dimension. 3586 3587 "hidden_remainder_y" 3588 The grid dispatch work group size of the the partial work group 3589 of the Y dimension, if it exists. Must be zero if a partial 3590 work group does not exist in the Y dimension. 3591 3592 "hidden_remainder_z" 3593 The grid dispatch work group size of the the partial work group 3594 of the Z dimension, if it exists. Must be zero if a partial 3595 work group does not exist in the Z dimension. 3596 3597 "hidden_grid_dims" 3598 The grid dispatch dimensionality. This is the same value 3599 as the AQL dispatch packet dimensionality. Must be a value 3600 between 1 and 3. 3601 3602 "hidden_heap_v1" 3603 A global address space pointer to an initialized memory 3604 buffer that conforms to the requirements of the malloc/free 3605 device library V1 version implementation. 3606 3607 "hidden_private_base" 3608 The high 32 bits of the flat addressing private aperture base. 3609 Only used by GFX8 to allow conversion between private segment 3610 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3611 3612 "hidden_shared_base" 3613 The high 32 bits of the flat addressing shared aperture base. 3614 Only used by GFX8 to allow conversion between shared segment 3615 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 3616 3617 "hidden_queue_ptr" 3618 A global memory address space pointer to the ROCm runtime 3619 ``struct amd_queue_t`` structure for the HSA queue of the 3620 associated dispatch AQL packet. It is only required for pre-GFX9 3621 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`). 3622 3623 ====================== ============== ========= ================================ 3624 3625.. 3626 3627Kernel Dispatch 3628~~~~~~~~~~~~~~~ 3629 3630The HSA architected queuing language (AQL) defines a user space memory interface 3631that can be used to control the dispatch of kernels, in an agent independent 3632way. An agent can have zero or more AQL queues created for it using an HSA 3633compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3634are 64 bytes) can be placed. See the *HSA Platform System Architecture 3635Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3636 3637The packet processor of a kernel agent is responsible for detecting and 3638dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3639packet processor is implemented by the hardware command processor (CP), 3640asynchronous dispatch controller (ADC) and shader processor input controller 3641(SPI). 3642 3643An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3644the kernel mode driver to initialize and register the AQL queue with CP. 3645 3646To dispatch a kernel the following actions are performed. This can occur in the 3647CPU host program, or from an HSA kernel executing on a GPU. 3648 36491. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3650 executed is obtained. 36512. A pointer to the kernel descriptor (see 3652 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3653 It must be for a kernel that is contained in a code object that was loaded 3654 by an HSA compatible runtime on the kernel agent with which the AQL queue is 3655 associated. 36563. Space is allocated for the kernel arguments using the HSA compatible runtime 3657 allocator for a memory region with the kernarg property for the kernel agent 3658 that will execute the kernel. It must be at least 16-byte aligned. 36594. Kernel argument values are assigned to the kernel argument memory 3660 allocation. The layout is defined in the *HSA Programmer's Language 3661 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3662 kernel argument memory in the same way constant memory is accessed. (Note 3663 that the HSA specification allows an implementation to copy the kernel 3664 argument contents to another location that is accessed by the kernel.) 36655. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3666 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3667 for the packet. The packet must be set up, and the final write must use an 3668 atomic store release to set the packet kind to ensure the packet contents are 3669 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3670 notify the kernel agent that the AQL queue has been updated. These rules, and 3671 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3672 System Architecture Specification* [HSA]_. 36736. A kernel dispatch packet includes information about the actual dispatch, 3674 such as grid and work-group size, together with information from the code 3675 object about the kernel, such as segment sizes. The HSA compatible runtime 3676 queries on the kernel symbol can be used to obtain the code object values 3677 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 36787. CP executes micro-code and is responsible for detecting and setting up the 3679 GPU to execute the wavefronts of a kernel dispatch. 36808. CP ensures that when the a wavefront starts executing the kernel machine 3681 code, the scalar general purpose registers (SGPR) and vector general purpose 3682 registers (VGPR) are set up as required by the machine code. The required 3683 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3684 register state is defined in 3685 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 36869. The prolog of the kernel machine code (see 3687 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3688 before continuing executing the machine code that corresponds to the kernel. 368910. When the kernel dispatch has completed execution, CP signals the completion 3690 signal specified in the kernel dispatch packet if not 0. 3691 3692.. _amdgpu-amdhsa-memory-spaces: 3693 3694Memory Spaces 3695~~~~~~~~~~~~~ 3696 3697The memory space properties are: 3698 3699 .. table:: AMDHSA Memory Spaces 3700 :name: amdgpu-amdhsa-memory-spaces-table 3701 3702 ================= =========== ======== ======= ================== 3703 Memory Space Name HSA Segment Hardware Address NULL Value 3704 Name Name Size 3705 ================= =========== ======== ======= ================== 3706 Private private scratch 32 0x00000000 3707 Local group LDS 32 0xFFFFFFFF 3708 Global global global 64 0x0000000000000000 3709 Constant constant *same as 64 0x0000000000000000 3710 global* 3711 Generic flat flat 64 0x0000000000000000 3712 Region N/A GDS 32 *not implemented 3713 for AMDHSA* 3714 ================= =========== ======== ======= ================== 3715 3716The global and constant memory spaces both use global virtual addresses, which 3717are the same virtual address space used by the CPU. However, some virtual 3718addresses may only be accessible to the CPU, some only accessible by the GPU, 3719and some by both. 3720 3721Using the constant memory space indicates that the data will not change during 3722the execution of the kernel. This allows scalar read instructions to be 3723used. The vector and scalar L1 caches are invalidated of volatile data before 3724each kernel dispatch execution to allow constant memory to change values between 3725kernel dispatches. 3726 3727The local memory space uses the hardware Local Data Store (LDS) which is 3728automatically allocated when the hardware creates work-groups of wavefronts, and 3729freed when all the wavefronts of a work-group have terminated. The data store 3730(DS) instructions can be used to access it. 3731 3732The private memory space uses the hardware scratch memory support. If the kernel 3733uses scratch, then the hardware allocates memory that is accessed using 3734wavefront lane dword (4 byte) interleaving. The mapping used from private 3735address to physical address is: 3736 3737 ``wavefront-scratch-base + 3738 (private-address * wavefront-size * 4) + 3739 (wavefront-lane-id * 4)`` 3740 3741There are different ways that the wavefront scratch base address is determined 3742by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3743memory can be accessed in an interleaved manner using buffer instruction with 3744the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3745instructions, or by flat instructions. If each lane of a wavefront accesses the 3746same private address, the interleaving results in adjacent dwords being accessed 3747and hence requires fewer cache lines to be fetched. Multi-dword access is not 3748supported except by flat and scratch instructions in GFX9-GFX10. 3749 3750The generic address space uses the hardware flat address support available in 3751GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3752local apertures), that are outside the range of addressible global memory, to 3753map from a flat address to a private or local address. 3754 3755FLAT instructions can take a flat address and access global, private (scratch) 3756and group (LDS) memory depending on if the address is within one of the 3757aperture ranges. Flat access to scratch requires hardware aperture setup and 3758setup in the kernel prologue (see 3759:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3760hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3761:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3762 3763To convert between a segment address and a flat address the base address of the 3764apertures address can be used. For GFX7-GFX8 these are available in the 3765:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3766Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3767GFX9-GFX10 the aperture base addresses are directly available as inline constant 3768registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3769address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3770which makes it easier to convert from flat to segment or segment to flat. 3771 3772Image and Samplers 3773~~~~~~~~~~~~~~~~~~ 3774 3775Image and sample handles created by an HSA compatible runtime (see 3776:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3777object respectively. In order to support the HSA ``query_sampler`` operations 3778two extra dwords are used to store the HSA BRIG enumeration values for the 3779queries that are not trivially deducible from the S# representation. 3780 3781HSA Signals 3782~~~~~~~~~~~ 3783 3784HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3785are 64-bit addresses of a structure allocated in memory accessible from both the 3786CPU and GPU. The structure is defined by the runtime and subject to change 3787between releases. For example, see [AMD-ROCm-github]_. 3788 3789.. _amdgpu-amdhsa-hsa-aql-queue: 3790 3791HSA AQL Queue 3792~~~~~~~~~~~~~ 3793 3794The HSA AQL queue structure is defined by an HSA compatible runtime (see 3795:ref:`amdgpu-os`) and subject to change between releases. For example, see 3796[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3797certain language features such as the flat address aperture bases. It also 3798contains fields used by CP such as managing the allocation of scratch memory. 3799 3800.. _amdgpu-amdhsa-kernel-descriptor: 3801 3802Kernel Descriptor 3803~~~~~~~~~~~~~~~~~ 3804 3805A kernel descriptor consists of the information needed by CP to initiate the 3806execution of a kernel, including the entry point address of the machine code 3807that implements the kernel. 3808 3809Code Object V3 Kernel Descriptor 3810++++++++++++++++++++++++++++++++ 3811 3812CP microcode requires the Kernel descriptor to be allocated on 64-byte 3813alignment. 3814 3815The fields used by CP for code objects before V3 also match those specified in 3816:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3817 3818 .. table:: Code Object V3 Kernel Descriptor 3819 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3820 3821 ======= ======= =============================== ============================ 3822 Bits Size Field Name Description 3823 ======= ======= =============================== ============================ 3824 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3825 address space memory 3826 required for a work-group 3827 in bytes. This does not 3828 include any dynamically 3829 allocated local address 3830 space memory that may be 3831 added when the kernel is 3832 dispatched. 3833 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3834 private address space 3835 memory required for a 3836 work-item in bytes. 3837 Additional space may need to 3838 be added to this value if 3839 the call stack has 3840 non-inlined function calls. 3841 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3842 memory pointed to by the 3843 AQL dispatch packet. The 3844 kernarg memory is used to 3845 pass arguments to the 3846 kernel. 3847 3848 * If the kernarg pointer in 3849 the dispatch packet is NULL 3850 then there are no kernel 3851 arguments. 3852 * If the kernarg pointer in 3853 the dispatch packet is 3854 not NULL and this value 3855 is 0 then the kernarg 3856 memory size is 3857 unspecified. 3858 * If the kernarg pointer in 3859 the dispatch packet is 3860 not NULL and this value 3861 is not 0 then the value 3862 specifies the kernarg 3863 memory size in bytes. It 3864 is recommended to provide 3865 a value as it may be used 3866 by CP to optimize making 3867 the kernarg memory 3868 visible to the kernel 3869 code. 3870 3871 127:96 4 bytes Reserved, must be 0. 3872 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3873 negative) from base 3874 address of kernel 3875 descriptor to kernel's 3876 entry point instruction 3877 which must be 256 byte 3878 aligned. 3879 351:272 20 Reserved, must be 0. 3880 bytes 3881 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3882 Reserved, must be 0. 3883 GFX90A, GFX940 3884 Compute Shader (CS) 3885 program settings used by 3886 CP to set up 3887 ``COMPUTE_PGM_RSRC3`` 3888 configuration 3889 register. See 3890 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3891 GFX10 3892 Compute Shader (CS) 3893 program settings used by 3894 CP to set up 3895 ``COMPUTE_PGM_RSRC3`` 3896 configuration 3897 register. See 3898 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3899 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3900 program settings used by 3901 CP to set up 3902 ``COMPUTE_PGM_RSRC1`` 3903 configuration 3904 register. See 3905 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3906 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3907 program settings used by 3908 CP to set up 3909 ``COMPUTE_PGM_RSRC2`` 3910 configuration 3911 register. See 3912 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3913 458:448 7 bits *See separate bits below.* Enable the setup of the 3914 SGPR user data registers 3915 (see 3916 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3917 3918 The total number of SGPR 3919 user data registers 3920 requested must not exceed 3921 16 and match value in 3922 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3923 Any requests beyond 16 3924 will be ignored. 3925 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3926 _BUFFER column of 3927 :ref:`amdgpu-processor-table` 3928 specifies *Architected flat 3929 scratch* then not supported 3930 and must be 0, 3931 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3932 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3933 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3934 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3935 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3936 column of 3937 :ref:`amdgpu-processor-table` 3938 specifies *Architected flat 3939 scratch* then not supported 3940 and must be 0, 3941 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3942 _SIZE 3943 457:455 3 bits Reserved, must be 0. 3944 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3945 Reserved, must be 0. 3946 GFX10 3947 - If 0 execute in 3948 wavefront size 64 mode. 3949 - If 1 execute in 3950 native wavefront size 3951 32 mode. 3952 463:459 1 bit Reserved, must be 0. 3953 464 1 bit RESERVED_464 Deprecated, must be 0. 3954 467:465 3 bits Reserved, must be 0. 3955 468 1 bit RESERVED_468 Deprecated, must be 0. 3956 469:471 3 bits Reserved, must be 0. 3957 511:472 5 bytes Reserved, must be 0. 3958 512 **Total size 64 bytes.** 3959 ======= ==================================================================== 3960 3961.. 3962 3963 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3964 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3965 3966 ======= ======= =============================== =========================================================================== 3967 Bits Size Field Name Description 3968 ======= ======= =============================== =========================================================================== 3969 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3970 blocks used by each work-item; 3971 granularity is device 3972 specific: 3973 3974 GFX6-GFX9 3975 - vgprs_used 0..256 3976 - max(0, ceil(vgprs_used / 4) - 1) 3977 GFX90A, GFX940 3978 - vgprs_used 0..512 3979 - vgprs_used = align(arch_vgprs, 4) 3980 + acc_vgprs 3981 - max(0, ceil(vgprs_used / 8) - 1) 3982 GFX10 (wavefront size 64) 3983 - max_vgpr 1..256 3984 - max(0, ceil(vgprs_used / 4) - 1) 3985 GFX10 (wavefront size 32) 3986 - max_vgpr 1..256 3987 - max(0, ceil(vgprs_used / 8) - 1) 3988 3989 Where vgprs_used is defined 3990 as the highest VGPR number 3991 explicitly referenced plus 3992 one. 3993 3994 Used by CP to set up 3995 ``COMPUTE_PGM_RSRC1.VGPRS``. 3996 3997 The 3998 :ref:`amdgpu-assembler` 3999 calculates this 4000 automatically for the 4001 selected processor from 4002 values provided to the 4003 `.amdhsa_kernel` directive 4004 by the 4005 `.amdhsa_next_free_vgpr` 4006 nested directive (see 4007 :ref:`amdhsa-kernel-directives-table`). 4008 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 4009 blocks used by a wavefront; 4010 granularity is device 4011 specific: 4012 4013 GFX6-GFX8 4014 - sgprs_used 0..112 4015 - max(0, ceil(sgprs_used / 8) - 1) 4016 GFX9 4017 - sgprs_used 0..112 4018 - 2 * max(0, ceil(sgprs_used / 16) - 1) 4019 GFX10 4020 Reserved, must be 0. 4021 (128 SGPRs always 4022 allocated.) 4023 4024 Where sgprs_used is 4025 defined as the highest 4026 SGPR number explicitly 4027 referenced plus one, plus 4028 a target specific number 4029 of additional special 4030 SGPRs for VCC, 4031 FLAT_SCRATCH (GFX7+) and 4032 XNACK_MASK (GFX8+), and 4033 any additional 4034 target specific 4035 limitations. It does not 4036 include the 16 SGPRs added 4037 if a trap handler is 4038 enabled. 4039 4040 The target specific 4041 limitations and special 4042 SGPR layout are defined in 4043 the hardware 4044 documentation, which can 4045 be found in the 4046 :ref:`amdgpu-processors` 4047 table. 4048 4049 Used by CP to set up 4050 ``COMPUTE_PGM_RSRC1.SGPRS``. 4051 4052 The 4053 :ref:`amdgpu-assembler` 4054 calculates this 4055 automatically for the 4056 selected processor from 4057 values provided to the 4058 `.amdhsa_kernel` directive 4059 by the 4060 `.amdhsa_next_free_sgpr` 4061 and `.amdhsa_reserve_*` 4062 nested directives (see 4063 :ref:`amdhsa-kernel-directives-table`). 4064 11:10 2 bits PRIORITY Must be 0. 4065 4066 Start executing wavefront 4067 at the specified priority. 4068 4069 CP is responsible for 4070 filling in 4071 ``COMPUTE_PGM_RSRC1.PRIORITY``. 4072 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 4073 with specified rounding 4074 mode for single (32 4075 bit) floating point 4076 precision floating point 4077 operations. 4078 4079 Floating point rounding 4080 mode values are defined in 4081 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4082 4083 Used by CP to set up 4084 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4085 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 4086 with specified rounding 4087 denorm mode for half/double (16 4088 and 64-bit) floating point 4089 precision floating point 4090 operations. 4091 4092 Floating point rounding 4093 mode values are defined in 4094 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4095 4096 Used by CP to set up 4097 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4098 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 4099 with specified denorm mode 4100 for single (32 4101 bit) floating point 4102 precision floating point 4103 operations. 4104 4105 Floating point denorm mode 4106 values are defined in 4107 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4108 4109 Used by CP to set up 4110 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4111 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 4112 with specified denorm mode 4113 for half/double (16 4114 and 64-bit) floating point 4115 precision floating point 4116 operations. 4117 4118 Floating point denorm mode 4119 values are defined in 4120 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4121 4122 Used by CP to set up 4123 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 4124 20 1 bit PRIV Must be 0. 4125 4126 Start executing wavefront 4127 in privilege trap handler 4128 mode. 4129 4130 CP is responsible for 4131 filling in 4132 ``COMPUTE_PGM_RSRC1.PRIV``. 4133 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 4134 with DX10 clamp mode 4135 enabled. Used by the vector 4136 ALU to force DX10 style 4137 treatment of NaN's (when 4138 set, clamp NaN to zero, 4139 otherwise pass NaN 4140 through). 4141 4142 Used by CP to set up 4143 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 4144 22 1 bit DEBUG_MODE Must be 0. 4145 4146 Start executing wavefront 4147 in single step mode. 4148 4149 CP is responsible for 4150 filling in 4151 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 4152 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 4153 with IEEE mode 4154 enabled. Floating point 4155 opcodes that support 4156 exception flag gathering 4157 will quiet and propagate 4158 signaling-NaN inputs per 4159 IEEE 754-2008. Min_dx10 and 4160 max_dx10 become IEEE 4161 754-2008 compliant due to 4162 signaling-NaN propagation 4163 and quieting. 4164 4165 Used by CP to set up 4166 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 4167 24 1 bit BULKY Must be 0. 4168 4169 Only one work-group allowed 4170 to execute on a compute 4171 unit. 4172 4173 CP is responsible for 4174 filling in 4175 ``COMPUTE_PGM_RSRC1.BULKY``. 4176 25 1 bit CDBG_USER Must be 0. 4177 4178 Flag that can be used to 4179 control debugging code. 4180 4181 CP is responsible for 4182 filling in 4183 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 4184 26 1 bit FP16_OVFL GFX6-GFX8 4185 Reserved, must be 0. 4186 GFX9-GFX10 4187 Wavefront starts execution 4188 with specified fp16 overflow 4189 mode. 4190 4191 - If 0, fp16 overflow generates 4192 +/-INF values. 4193 - If 1, fp16 overflow that is the 4194 result of an +/-INF input value 4195 or divide by 0 produces a +/-INF, 4196 otherwise clamps computed 4197 overflow to +/-MAX_FP16 as 4198 appropriate. 4199 4200 Used by CP to set up 4201 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 4202 28:27 2 bits Reserved, must be 0. 4203 29 1 bit WGP_MODE GFX6-GFX9 4204 Reserved, must be 0. 4205 GFX10 4206 - If 0 execute work-groups in 4207 CU wavefront execution mode. 4208 - If 1 execute work-groups on 4209 in WGP wavefront execution mode. 4210 4211 See :ref:`amdgpu-amdhsa-memory-model`. 4212 4213 Used by CP to set up 4214 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 4215 30 1 bit MEM_ORDERED GFX6-GFX9 4216 Reserved, must be 0. 4217 GFX10 4218 Controls the behavior of the 4219 s_waitcnt's vmcnt and vscnt 4220 counters. 4221 4222 - If 0 vmcnt reports completion 4223 of load and atomic with return 4224 out of order with sample 4225 instructions, and the vscnt 4226 reports the completion of 4227 store and atomic without 4228 return in order. 4229 - If 1 vmcnt reports completion 4230 of load, atomic with return 4231 and sample instructions in 4232 order, and the vscnt reports 4233 the completion of store and 4234 atomic without return in order. 4235 4236 Used by CP to set up 4237 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 4238 31 1 bit FWD_PROGRESS GFX6-GFX9 4239 Reserved, must be 0. 4240 GFX10 4241 - If 0 execute SIMD wavefronts 4242 using oldest first policy. 4243 - If 1 execute SIMD wavefronts to 4244 ensure wavefronts will make some 4245 forward progress. 4246 4247 Used by CP to set up 4248 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 4249 32 **Total size 4 bytes** 4250 ======= =================================================================================================================== 4251 4252.. 4253 4254 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 4255 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 4256 4257 ======= ======= =============================== =========================================================================== 4258 Bits Size Field Name Description 4259 ======= ======= =============================== =========================================================================== 4260 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 4261 private segment. 4262 * If the *Target Properties* 4263 column of 4264 :ref:`amdgpu-processor-table` 4265 does not specify 4266 *Architected flat 4267 scratch* then enable the 4268 setup of the SGPR 4269 wavefront scratch offset 4270 system register (see 4271 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4272 * If the *Target Properties* 4273 column of 4274 :ref:`amdgpu-processor-table` 4275 specifies *Architected 4276 flat scratch* then enable 4277 the setup of the 4278 FLAT_SCRATCH register 4279 pair (see 4280 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4281 4282 Used by CP to set up 4283 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4284 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4285 user data 4286 registers requested. This 4287 number must be greater than 4288 or equal to the number of user 4289 data registers enabled. 4290 4291 Used by CP to set up 4292 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4293 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4294 4295 This bit represents 4296 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4297 which is set by the CP if 4298 the runtime has installed a 4299 trap handler. 4300 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4301 system SGPR register for 4302 the work-group id in the X 4303 dimension (see 4304 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4305 4306 Used by CP to set up 4307 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4308 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4309 system SGPR register for 4310 the work-group id in the Y 4311 dimension (see 4312 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4313 4314 Used by CP to set up 4315 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4316 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4317 system SGPR register for 4318 the work-group id in the Z 4319 dimension (see 4320 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4321 4322 Used by CP to set up 4323 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4324 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4325 system SGPR register for 4326 work-group information (see 4327 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4328 4329 Used by CP to set up 4330 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4331 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4332 VGPR system registers used 4333 for the work-item ID. 4334 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4335 defines the values. 4336 4337 Used by CP to set up 4338 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4339 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4340 4341 Wavefront starts execution 4342 with address watch 4343 exceptions enabled which 4344 are generated when L1 has 4345 witnessed a thread access 4346 an *address of 4347 interest*. 4348 4349 CP is responsible for 4350 filling in the address 4351 watch bit in 4352 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4353 according to what the 4354 runtime requests. 4355 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4356 4357 Wavefront starts execution 4358 with memory violation 4359 exceptions exceptions 4360 enabled which are generated 4361 when a memory violation has 4362 occurred for this wavefront from 4363 L1 or LDS 4364 (write-to-read-only-memory, 4365 mis-aligned atomic, LDS 4366 address out of range, 4367 illegal address, etc.). 4368 4369 CP sets the memory 4370 violation bit in 4371 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4372 according to what the 4373 runtime requests. 4374 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4375 4376 CP uses the rounded value 4377 from the dispatch packet, 4378 not this value, as the 4379 dispatch may contain 4380 dynamically allocated group 4381 segment memory. CP writes 4382 directly to 4383 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4384 4385 Amount of group segment 4386 (LDS) to allocate for each 4387 work-group. Granularity is 4388 device specific: 4389 4390 GFX6 4391 roundup(lds-size / (64 * 4)) 4392 GFX7-GFX10 4393 roundup(lds-size / (128 * 4)) 4394 4395 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4396 _INVALID_OPERATION with specified exceptions 4397 enabled. 4398 4399 Used by CP to set up 4400 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4401 (set from bits 0..6). 4402 4403 IEEE 754 FP Invalid 4404 Operation 4405 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4406 _SOURCE input operands is a 4407 denormal number 4408 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4409 _DIVISION_BY_ZERO Zero 4410 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4411 _OVERFLOW 4412 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4413 _UNDERFLOW 4414 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4415 _INEXACT 4416 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4417 _ZERO (rcp_iflag_f32 instruction 4418 only) 4419 31 1 bit Reserved, must be 0. 4420 32 **Total size 4 bytes.** 4421 ======= =================================================================================================================== 4422 4423.. 4424 4425 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940 4426 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4427 4428 ======= ======= =============================== =========================================================================== 4429 Bits Size Field Name Description 4430 ======= ======= =============================== =========================================================================== 4431 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4432 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4433 63 - accum-offset = 256. 4434 6:15 10 Reserved, must be 0. 4435 bits 4436 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4437 launched in the same CU. 4438 - If 1 the waves of a work-group can be 4439 launched in different CUs. The waves 4440 cannot use S_BARRIER or LDS. 4441 17:31 15 Reserved, must be 0. 4442 bits 4443 32 **Total size 4 bytes.** 4444 ======= =================================================================================================================== 4445 4446.. 4447 4448 .. table:: compute_pgm_rsrc3 for GFX10 4449 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4450 4451 ======= ======= =============================== =========================================================================== 4452 Bits Size Field Name Description 4453 ======= ======= =============================== =========================================================================== 4454 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For 4455 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity 4456 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does 4457 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0. 4458 31:4 28 Reserved, must be 0. 4459 bits 4460 32 **Total size 4 bytes.** 4461 ======= =================================================================================================================== 4462 4463.. 4464 4465 .. table:: Floating Point Rounding Mode Enumeration Values 4466 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4467 4468 ====================================== ===== ============================== 4469 Enumeration Name Value Description 4470 ====================================== ===== ============================== 4471 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4472 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4473 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4474 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4475 ====================================== ===== ============================== 4476 4477.. 4478 4479 .. table:: Floating Point Denorm Mode Enumeration Values 4480 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4481 4482 ====================================== ===== ============================== 4483 Enumeration Name Value Description 4484 ====================================== ===== ============================== 4485 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4486 Denorms 4487 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4488 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4489 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4490 ====================================== ===== ============================== 4491 4492.. 4493 4494 .. table:: System VGPR Work-Item ID Enumeration Values 4495 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4496 4497 ======================================== ===== ============================ 4498 Enumeration Name Value Description 4499 ======================================== ===== ============================ 4500 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4501 ID. 4502 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4503 dimensions ID. 4504 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4505 dimensions ID. 4506 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4507 ======================================== ===== ============================ 4508 4509.. _amdgpu-amdhsa-initial-kernel-execution-state: 4510 4511Initial Kernel Execution State 4512~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4513 4514This section defines the register state that will be set up by the packet 4515processor prior to the start of execution of every wavefront. This is limited by 4516the constraints of the hardware controllers of CP/ADC/SPI. 4517 4518The order of the SGPR registers is defined, but the compiler can specify which 4519ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4520fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4521for enabled registers are dense starting at SGPR0: the first enabled register is 4522SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4523an SGPR number. 4524 4525The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4526all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4527using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4528actually initialized. These are then immediately followed by the System SGPRs 4529that are set up by ADC/SPI and can have different values for each wavefront of 4530the grid dispatch. 4531 4532SGPR register initial state is defined in 4533:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4534 4535 .. table:: SGPR Register Set Up Order 4536 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4537 4538 ========== ========================== ====== ============================== 4539 SGPR Order Name Number Description 4540 (kernel descriptor enable of 4541 field) SGPRs 4542 ========== ========================== ====== ============================== 4543 First Private Segment Buffer 4 See 4544 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4545 _segment_buffer) 4546 then Dispatch Ptr 2 64-bit address of AQL dispatch 4547 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4548 actually executing. 4549 then Queue Ptr 2 64-bit address of amd_queue_t 4550 (enable_sgpr_queue_ptr) object for AQL queue on which 4551 the dispatch packet was 4552 queued. 4553 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4554 (enable_sgpr_kernarg segment. This is directly 4555 _segment_ptr) copied from the 4556 kernarg_address in the kernel 4557 dispatch packet. 4558 4559 Having CP load it once avoids 4560 loading it at the beginning of 4561 every wavefront. 4562 then Dispatch Id 2 64-bit Dispatch ID of the 4563 (enable_sgpr_dispatch_id) dispatch packet being 4564 executed. 4565 then Flat Scratch Init 2 See 4566 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4567 _init) 4568 then Private Segment Size 1 The 32-bit byte size of a 4569 (enable_sgpr_private single work-item's memory 4570 _segment_size) allocation. This is the 4571 value from the kernel 4572 dispatch packet Private 4573 Segment Byte Size rounded up 4574 by CP to a multiple of 4575 DWORD. 4576 4577 Having CP load it once avoids 4578 loading it at the beginning of 4579 every wavefront. 4580 4581 This is not used for 4582 GFX7-GFX8 since it is the same 4583 value as the second SGPR of 4584 Flat Scratch Init. However, it 4585 may be needed for GFX9-GFX10 which 4586 changes the meaning of the 4587 Flat Scratch Init value. 4588 then Work-Group Id X 1 32-bit work-group id in X 4589 (enable_sgpr_workgroup_id dimension of grid for 4590 _X) wavefront. 4591 then Work-Group Id Y 1 32-bit work-group id in Y 4592 (enable_sgpr_workgroup_id dimension of grid for 4593 _Y) wavefront. 4594 then Work-Group Id Z 1 32-bit work-group id in Z 4595 (enable_sgpr_workgroup_id dimension of grid for 4596 _Z) wavefront. 4597 then Work-Group Info 1 {first_wavefront, 14'b0000, 4598 (enable_sgpr_workgroup ordered_append_term[10:0], 4599 _info) threadgroup_size_in_wavefronts[5:0]} 4600 then Scratch Wavefront Offset 1 See 4601 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4602 _segment_wavefront_offset) and 4603 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4604 ========== ========================== ====== ============================== 4605 4606The order of the VGPR registers is defined, but the compiler can specify which 4607ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4608fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4609for enabled registers are dense starting at VGPR0: the first enabled register is 4610VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4611VGPR number. 4612 4613There are different methods used for the VGPR initial state: 4614 4615* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4616 specifies otherwise, a separate VGPR register is used per work-item ID. The 4617 VGPR register initial state for this method is defined in 4618 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4619* If *Target Properties* column of :ref:`amdgpu-processor-table` 4620 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4621 for all work-item IDs. The register layout for this method is defined in 4622 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4623 4624 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4625 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4626 4627 ========== ========================== ====== ============================== 4628 VGPR Order Name Number Description 4629 (kernel descriptor enable of 4630 field) VGPRs 4631 ========== ========================== ====== ============================== 4632 First Work-Item Id X 1 32-bit work-item id in X 4633 (Always initialized) dimension of work-group for 4634 wavefront lane. 4635 then Work-Item Id Y 1 32-bit work-item id in Y 4636 (enable_vgpr_workitem_id dimension of work-group for 4637 > 0) wavefront lane. 4638 then Work-Item Id Z 1 32-bit work-item id in Z 4639 (enable_vgpr_workitem_id dimension of work-group for 4640 > 1) wavefront lane. 4641 ========== ========================== ====== ============================== 4642 4643.. 4644 4645 .. table:: Register Layout for Packed Work-Item ID Method 4646 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4647 4648 ======= ======= ================ ========================================= 4649 Bits Size Field Name Description 4650 ======= ======= ================ ========================================= 4651 0:9 10 bits Work-Item Id X Work-item id in X 4652 dimension of work-group for 4653 wavefront lane. 4654 4655 Always initialized. 4656 4657 10:19 10 bits Work-Item Id Y Work-item id in Y 4658 dimension of work-group for 4659 wavefront lane. 4660 4661 Initialized if enable_vgpr_workitem_id > 4662 0, otherwise set to 0. 4663 20:29 10 bits Work-Item Id Z Work-item id in Z 4664 dimension of work-group for 4665 wavefront lane. 4666 4667 Initialized if enable_vgpr_workitem_id > 4668 1, otherwise set to 0. 4669 30:31 2 bits Reserved, set to 0. 4670 ======= ======= ================ ========================================= 4671 4672The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4673 46741. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4675 registers. 46762. Work-group Id registers X, Y, Z are set by ADC which supports any 4677 combination including none. 46783. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4679 its value cannot be included with the flat scratch init value which is per 4680 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 46814. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4682 or (X, Y, Z). 46835. Flat Scratch register pair initialization is described in 4684 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4685 4686The global segment can be accessed either using buffer instructions (GFX6 which 4687has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4688instructions (GFX9-GFX10). 4689 4690If buffer operations are used, then the compiler can generate a V# with the 4691following properties: 4692 4693* base address of 0 4694* no swizzle 4695* ATC: 1 if IOMMU present (such as APU) 4696* ptr64: 1 4697* MTYPE set to support memory coherence that matches the runtime (such as CC for 4698 APU and NC for dGPU). 4699 4700.. _amdgpu-amdhsa-kernel-prolog: 4701 4702Kernel Prolog 4703~~~~~~~~~~~~~ 4704 4705The compiler performs initialization in the kernel prologue depending on the 4706target and information about things like stack usage in the kernel and called 4707functions. Some of this initialization requires the compiler to request certain 4708User and System SGPRs be present in the 4709:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4710:ref:`amdgpu-amdhsa-kernel-descriptor`. 4711 4712.. _amdgpu-amdhsa-kernel-prolog-cfi: 4713 4714CFI 4715+++ 4716 47171. The CFI return address is undefined. 4718 47192. The CFI CFA is defined using an expression which evaluates to a location 4720 description that comprises one memory location description for the 4721 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4722 4723.. _amdgpu-amdhsa-kernel-prolog-m0: 4724 4725M0 4726++ 4727 4728GFX6-GFX8 4729 The M0 register must be initialized with a value at least the total LDS size 4730 if the kernel may access LDS via DS or flat operations. Total LDS size is 4731 available in dispatch packet. For M0, it is also possible to use maximum 4732 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4733 GFX7-GFX8). 4734GFX9-GFX10 4735 The M0 register is not used for range checking LDS accesses and so does not 4736 need to be initialized in the prolog. 4737 4738.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4739 4740Stack Pointer 4741+++++++++++++ 4742 4743If the kernel has function calls it must set up the ABI stack pointer described 4744in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4745SGPR32 to the unswizzled scratch offset of the address past the last local 4746allocation. 4747 4748.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4749 4750Frame Pointer 4751+++++++++++++ 4752 4753If the kernel needs a frame pointer for the reasons defined in 4754``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4755kernel prolog. If a frame pointer is not required then all uses of the frame 4756pointer are replaced with immediate ``0`` offsets. 4757 4758.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4759 4760Flat Scratch 4761++++++++++++ 4762 4763There are different methods used for initializing flat scratch: 4764 4765* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4766 specifies *Does not support generic address space*: 4767 4768 Flat scratch is not supported and there is no flat scratch register pair. 4769 4770* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4771 specifies *Offset flat scratch*: 4772 4773 If the kernel or any function it calls may use flat operations to access 4774 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4775 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4776 Scratch Wavefront Offset SGPR registers (see 4777 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4778 4779 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4780 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4781 being managed by SPI for the queue executing the kernel dispatch. This is 4782 the same value used in the Scratch Segment Buffer V# base address. 4783 4784 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4785 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4786 4787 The prolog must add the value of Scratch Wavefront Offset to get the 4788 wavefront's byte scratch backing memory offset from 4789 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4790 4791 The Scratch Wavefront Offset must also be used as an offset with Private 4792 segment address when using the Scratch Segment Buffer. 4793 4794 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4795 shifted by 8 before moving into FLAT_SCRATCH_HI. 4796 4797 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4798 SGPRn is the highest numbered SGPR allocated to the wavefront). 4799 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4800 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4801 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4802 aperture. 4803 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4804 work-items scratch memory usage. 4805 4806 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4807 checks that the value in the kernel dispatch packet Private Segment Byte 4808 Size is not larger and requests the runtime to increase the queue's scratch 4809 size if necessary. 4810 4811 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4812 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4813 loading it at the beginning of every wavefront. 4814 4815 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4816 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4817 in flat memory instructions. 4818 4819* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4820 specifies *Absolute flat scratch*: 4821 4822 If the kernel or any function it calls may use flat operations to access 4823 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4824 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4825 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4826 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4827 4828 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4829 memory being managed by SPI for the queue executing the kernel dispatch. 4830 4831 CP obtains this from the runtime. 4832 4833 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4834 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4835 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4836 memory instructions. 4837 4838 The Scratch Wavefront Offset must also be used as an offset with Private 4839 segment address when using the Scratch Segment Buffer (see 4840 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4841 4842* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4843 specifies *Architected flat scratch*: 4844 4845 If ENABLE_PRIVATE_SEGMENT is enabled in 4846 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH 4847 register pair will be initialized to the 64-bit address of the base of scratch 4848 backing memory being managed by SPI for the queue executing the kernel 4849 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4850 flat scratch base in flat memory instructions. 4851 4852.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4853 4854Private Segment Buffer 4855++++++++++++++++++++++ 4856 4857If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4858*Architected flat scratch* then a Private Segment Buffer is not supported. 4859Instead the flat SCRATCH instructions are used. 4860 4861Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4862that are used as a V# to access scratch. CP uses the value provided by the 4863runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4864access the private memory space using a segment address. See 4865:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4866 4867The scratch V# is a four-aligned SGPR and always selected for the kernel as 4868follows: 4869 4870 - If it is known during instruction selection that there is stack usage, 4871 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4872 optimizations are disabled (``-O0``), if stack objects already exist (for 4873 locals, etc.), or if there are any function calls. 4874 4875 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4876 are reserved for the tentative scratch V#. These will be used if it is 4877 determined that spilling is needed. 4878 4879 - If no use is made of the tentative scratch V#, then it is unreserved, 4880 and the register count is determined ignoring it. 4881 - If use is made of the tentative scratch V#, then its register numbers 4882 are shifted to the first four-aligned SGPR index after the highest one 4883 allocated by the register allocator, and all uses are updated. The 4884 register count includes them in the shifted location. 4885 - In either case, if the processor has the SGPR allocation bug, the 4886 tentative allocation is not shifted or unreserved in order to ensure 4887 the register count is higher to workaround the bug. 4888 4889 .. note:: 4890 4891 This approach of using a tentative scratch V# and shifting the register 4892 numbers if used avoids having to perform register allocation a second 4893 time if the tentative V# is eliminated. This is more efficient and 4894 avoids the problem that the second register allocation may perform 4895 spilling which will fail as there is no longer a scratch V#. 4896 4897When the kernel prolog code is being emitted it is known whether the scratch V# 4898described above is actually used. If it is, the prolog code must set it up by 4899copying the Private Segment Buffer to the scratch V# registers and then adding 4900the Private Segment Wavefront Offset to the queue base address in the V#. The 4901result is a V# with a base address pointing to the beginning of the wavefront 4902scratch backing memory. 4903 4904The Private Segment Buffer is always requested, but the Private Segment 4905Wavefront Offset is only requested if it is used (see 4906:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4907 4908.. _amdgpu-amdhsa-memory-model: 4909 4910Memory Model 4911~~~~~~~~~~~~ 4912 4913This section describes the mapping of the LLVM memory model onto AMDGPU machine 4914code (see :ref:`memmodel`). 4915 4916The AMDGPU backend supports the memory synchronization scopes specified in 4917:ref:`amdgpu-memory-scopes`. 4918 4919The code sequences used to implement the memory model specify the order of 4920instructions that a single thread must execute. The ``s_waitcnt`` and cache 4921management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4922to other memory instructions executed by the same thread. This allows them to be 4923moved earlier or later which can allow them to be combined with other instances 4924of the same instruction, or hoisted/sunk out of loops to improve performance. 4925Only the instructions related to the memory model are given; additional 4926``s_waitcnt`` instructions are required to ensure registers are defined before 4927being used. These may be able to be combined with the memory model ``s_waitcnt`` 4928instructions as described above. 4929 4930The AMDGPU backend supports the following memory models: 4931 4932 HSA Memory Model [HSA]_ 4933 The HSA memory model uses a single happens-before relation for all address 4934 spaces (see :ref:`amdgpu-address-spaces`). 4935 OpenCL Memory Model [OpenCL]_ 4936 The OpenCL memory model which has separate happens-before relations for the 4937 global and local address spaces. Only a fence specifying both global and 4938 local address space, and seq_cst instructions join the relationships. Since 4939 the LLVM ``memfence`` instruction does not allow an address space to be 4940 specified the OpenCL fence has to conservatively assume both local and 4941 global address space was specified. However, optimizations can often be 4942 done to eliminate the additional ``s_waitcnt`` instructions when there are 4943 no intervening memory instructions which access the corresponding address 4944 space. The code sequences in the table indicate what can be omitted for the 4945 OpenCL memory. The target triple environment is used to determine if the 4946 source language is OpenCL (see :ref:`amdgpu-opencl`). 4947 4948``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4949operations. 4950 4951``buffer/global/flat_load/store/atomic`` instructions to global memory are 4952termed vector memory operations. 4953 4954Private address space uses ``buffer_load/store`` using the scratch V# 4955(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4956is accessing the memory, atomic memory orderings are not meaningful, and all 4957accesses are treated as non-atomic. 4958 4959Constant address space uses ``buffer/global_load`` instructions (or equivalent 4960scalar memory instructions). Since the constant address space contents do not 4961change during the execution of a kernel dispatch it is not legal to perform 4962stores, and atomic memory orderings are not meaningful, and all accesses are 4963treated as non-atomic. 4964 4965A memory synchronization scope wider than work-group is not meaningful for the 4966group (LDS) address space and is treated as work-group. 4967 4968The memory model does not support the region address space which is treated as 4969non-atomic. 4970 4971Acquire memory ordering is not meaningful on store atomic instructions and is 4972treated as non-atomic. 4973 4974Release memory ordering is not meaningful on load atomic instructions and is 4975treated a non-atomic. 4976 4977Acquire-release memory ordering is not meaningful on load or store atomic 4978instructions and is treated as acquire and release respectively. 4979 4980The memory order also adds the single thread optimization constraints defined in 4981table 4982:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4983 4984 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4985 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4986 4987 ============ ============================================================== 4988 LLVM Memory Optimization Constraints 4989 Ordering 4990 ============ ============================================================== 4991 unordered *none* 4992 monotonic *none* 4993 acquire - If a load atomic/atomicrmw then no following load/load 4994 atomic/store/store atomic/atomicrmw/fence instruction can be 4995 moved before the acquire. 4996 - If a fence then same as load atomic, plus no preceding 4997 associated fence-paired-atomic can be moved after the fence. 4998 release - If a store atomic/atomicrmw then no preceding load/load 4999 atomic/store/store atomic/atomicrmw/fence instruction can be 5000 moved after the release. 5001 - If a fence then same as store atomic, plus no following 5002 associated fence-paired-atomic can be moved before the 5003 fence. 5004 acq_rel Same constraints as both acquire and release. 5005 seq_cst - If a load atomic then same constraints as acquire, plus no 5006 preceding sequentially consistent load atomic/store 5007 atomic/atomicrmw/fence instruction can be moved after the 5008 seq_cst. 5009 - If a store atomic then the same constraints as release, plus 5010 no following sequentially consistent load atomic/store 5011 atomic/atomicrmw/fence instruction can be moved before the 5012 seq_cst. 5013 - If an atomicrmw/fence then same constraints as acq_rel. 5014 ============ ============================================================== 5015 5016The code sequences used to implement the memory model are defined in the 5017following sections: 5018 5019* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 5020* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 5021* :ref:`amdgpu-amdhsa-memory-model-gfx940` 5022* :ref:`amdgpu-amdhsa-memory-model-gfx10` 5023 5024.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 5025 5026Memory Model GFX6-GFX9 5027++++++++++++++++++++++ 5028 5029For GFX6-GFX9: 5030 5031* Each agent has multiple shader arrays (SA). 5032* Each SA has multiple compute units (CU). 5033* Each CU has multiple SIMDs that execute wavefronts. 5034* The wavefronts for a single work-group are executed in the same CU but may be 5035 executed by different SIMDs. 5036* Each CU has a single LDS memory shared by the wavefronts of the work-groups 5037 executing on it. 5038* All LDS operations of a CU are performed as wavefront wide operations in a 5039 global order and involve no caching. Completion is reported to a wavefront in 5040 execution order. 5041* The LDS memory has multiple request queues shared by the SIMDs of a 5042 CU. Therefore, the LDS operations performed by different wavefronts of a 5043 work-group can be reordered relative to each other, which can result in 5044 reordering the visibility of vector memory operations with respect to LDS 5045 operations of other wavefronts in the same work-group. A ``s_waitcnt 5046 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 5047 vector memory operations between wavefronts of a work-group, but not between 5048 operations performed by the same wavefront. 5049* The vector memory operations are performed as wavefront wide operations and 5050 completion is reported to a wavefront in execution order. The exception is 5051 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 5052 vector memory order if they access LDS memory, and out of LDS operation order 5053 if they access global memory. 5054* The vector memory operations access a single vector L1 cache shared by all 5055 SIMDs a CU. Therefore, no special action is required for coherence between the 5056 lanes of a single wavefront, or for coherence between wavefronts in the same 5057 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 5058 wavefronts executing in different work-groups as they may be executing on 5059 different CUs. 5060* The scalar memory operations access a scalar L1 cache shared by all wavefronts 5061 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 5062 scalar operations are used in a restricted way so do not impact the memory 5063 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 5064* The vector and scalar memory operations use an L2 cache shared by all CUs on 5065 the same agent. 5066* The L2 cache has independent channels to service disjoint ranges of virtual 5067 addresses. 5068* Each CU has a separate request queue per channel. Therefore, the vector and 5069 scalar memory operations performed by wavefronts executing in different 5070 work-groups (which may be executing on different CUs) of an agent can be 5071 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 5072 ensure synchronization between vector memory operations of different CUs. It 5073 ensures a previous vector memory operation has completed before executing a 5074 subsequent vector memory or LDS operation and so can be used to meet the 5075 requirements of acquire and release. 5076* The L2 cache can be kept coherent with other agents on some targets, or ranges 5077 of virtual addresses can be set up to bypass it to ensure system coherence. 5078 5079Scalar memory operations are only used to access memory that is proven to not 5080change during the execution of the kernel dispatch. This includes constant 5081address space and global address space for program scope ``const`` variables. 5082Therefore, the kernel machine code does not have to maintain the scalar cache to 5083ensure it is coherent with the vector caches. The scalar and vector caches are 5084invalidated between kernel dispatches by CP since constant address space data 5085may change between kernel dispatch executions. See 5086:ref:`amdgpu-amdhsa-memory-spaces`. 5087 5088The one exception is if scalar writes are used to spill SGPR registers. In this 5089case the AMDGPU backend ensures the memory location used to spill is never 5090accessed by vector memory operations at the same time. If scalar writes are used 5091then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 5092return since the locations may be used for vector memory instructions by a 5093future wavefront that uses the same scratch area, or a function call that 5094creates a frame at the same address, respectively. There is no need for a 5095``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 5096 5097For kernarg backing memory: 5098 5099* CP invalidates the L1 cache at the start of each kernel dispatch. 5100* On dGPU the kernarg backing memory is allocated in host memory accessed as 5101 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 5102 causes it to be treated as non-volatile and so is not invalidated by 5103 ``*_vol``. 5104* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 5105 and so the L2 cache will be coherent with the CPU and other agents. 5106 5107Scratch backing memory (which is used for the private address space) is accessed 5108with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 5109only accessed by a single thread, and is always write-before-read, there is 5110never a need to invalidate these entries from the L1 cache. Hence all cache 5111invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 5112 5113The code sequences used to implement the memory model for GFX6-GFX9 are defined 5114in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 5115 5116 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 5117 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 5118 5119 ============ ============ ============== ========== ================================ 5120 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 5121 Ordering Sync Scope Address GFX6-GFX9 5122 Space 5123 ============ ============ ============== ========== ================================ 5124 **Non-Atomic** 5125 ------------------------------------------------------------------------------------ 5126 load *none* *none* - global - !volatile & !nontemporal 5127 - generic 5128 - private 1. buffer/global/flat_load 5129 - constant 5130 - !volatile & nontemporal 5131 5132 1. buffer/global/flat_load 5133 glc=1 slc=1 5134 5135 - volatile 5136 5137 1. buffer/global/flat_load 5138 glc=1 5139 2. s_waitcnt vmcnt(0) 5140 5141 - Must happen before 5142 any following volatile 5143 global/generic 5144 load/store. 5145 - Ensures that 5146 volatile 5147 operations to 5148 different 5149 addresses will not 5150 be reordered by 5151 hardware. 5152 5153 load *none* *none* - local 1. ds_load 5154 store *none* *none* - global - !volatile & !nontemporal 5155 - generic 5156 - private 1. buffer/global/flat_store 5157 - constant 5158 - !volatile & nontemporal 5159 5160 1. buffer/global/flat_store 5161 glc=1 slc=1 5162 5163 - volatile 5164 5165 1. buffer/global/flat_store 5166 2. s_waitcnt vmcnt(0) 5167 5168 - Must happen before 5169 any following volatile 5170 global/generic 5171 load/store. 5172 - Ensures that 5173 volatile 5174 operations to 5175 different 5176 addresses will not 5177 be reordered by 5178 hardware. 5179 5180 store *none* *none* - local 1. ds_store 5181 **Unordered Atomic** 5182 ------------------------------------------------------------------------------------ 5183 load atomic unordered *any* *any* *Same as non-atomic*. 5184 store atomic unordered *any* *any* *Same as non-atomic*. 5185 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 5186 **Monotonic Atomic** 5187 ------------------------------------------------------------------------------------ 5188 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 5189 - wavefront - local 5190 - workgroup - generic 5191 load atomic monotonic - agent - global 1. buffer/global/flat_load 5192 - system - generic glc=1 5193 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 5194 - wavefront - generic 5195 - workgroup 5196 - agent 5197 - system 5198 store atomic monotonic - singlethread - local 1. ds_store 5199 - wavefront 5200 - workgroup 5201 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 5202 - wavefront - generic 5203 - workgroup 5204 - agent 5205 - system 5206 atomicrmw monotonic - singlethread - local 1. ds_atomic 5207 - wavefront 5208 - workgroup 5209 **Acquire Atomic** 5210 ------------------------------------------------------------------------------------ 5211 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 5212 - wavefront - local 5213 - generic 5214 load atomic acquire - workgroup - global 1. buffer/global_load 5215 load atomic acquire - workgroup - local 1. ds/flat_load 5216 - generic 2. s_waitcnt lgkmcnt(0) 5217 5218 - If OpenCL, omit. 5219 - Must happen before 5220 any following 5221 global/generic 5222 load/load 5223 atomic/store/store 5224 atomic/atomicrmw. 5225 - Ensures any 5226 following global 5227 data read is no 5228 older than a local load 5229 atomic value being 5230 acquired. 5231 5232 load atomic acquire - agent - global 1. buffer/global_load 5233 - system glc=1 5234 2. s_waitcnt vmcnt(0) 5235 5236 - Must happen before 5237 following 5238 buffer_wbinvl1_vol. 5239 - Ensures the load 5240 has completed 5241 before invalidating 5242 the cache. 5243 5244 3. buffer_wbinvl1_vol 5245 5246 - Must happen before 5247 any following 5248 global/generic 5249 load/load 5250 atomic/atomicrmw. 5251 - Ensures that 5252 following 5253 loads will not see 5254 stale global data. 5255 5256 load atomic acquire - agent - generic 1. flat_load glc=1 5257 - system 2. s_waitcnt vmcnt(0) & 5258 lgkmcnt(0) 5259 5260 - If OpenCL omit 5261 lgkmcnt(0). 5262 - Must happen before 5263 following 5264 buffer_wbinvl1_vol. 5265 - Ensures the flat_load 5266 has completed 5267 before invalidating 5268 the cache. 5269 5270 3. buffer_wbinvl1_vol 5271 5272 - Must happen before 5273 any following 5274 global/generic 5275 load/load 5276 atomic/atomicrmw. 5277 - Ensures that 5278 following loads 5279 will not see stale 5280 global data. 5281 5282 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5283 - wavefront - local 5284 - generic 5285 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5286 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5287 - generic 2. s_waitcnt lgkmcnt(0) 5288 5289 - If OpenCL, omit. 5290 - Must happen before 5291 any following 5292 global/generic 5293 load/load 5294 atomic/store/store 5295 atomic/atomicrmw. 5296 - Ensures any 5297 following global 5298 data read is no 5299 older than a local 5300 atomicrmw value 5301 being acquired. 5302 5303 atomicrmw acquire - agent - global 1. buffer/global_atomic 5304 - system 2. s_waitcnt vmcnt(0) 5305 5306 - Must happen before 5307 following 5308 buffer_wbinvl1_vol. 5309 - Ensures the 5310 atomicrmw has 5311 completed before 5312 invalidating the 5313 cache. 5314 5315 3. buffer_wbinvl1_vol 5316 5317 - Must happen before 5318 any following 5319 global/generic 5320 load/load 5321 atomic/atomicrmw. 5322 - Ensures that 5323 following loads 5324 will not see stale 5325 global data. 5326 5327 atomicrmw acquire - agent - generic 1. flat_atomic 5328 - system 2. s_waitcnt vmcnt(0) & 5329 lgkmcnt(0) 5330 5331 - If OpenCL, omit 5332 lgkmcnt(0). 5333 - Must happen before 5334 following 5335 buffer_wbinvl1_vol. 5336 - Ensures the 5337 atomicrmw has 5338 completed before 5339 invalidating the 5340 cache. 5341 5342 3. buffer_wbinvl1_vol 5343 5344 - Must happen before 5345 any following 5346 global/generic 5347 load/load 5348 atomic/atomicrmw. 5349 - Ensures that 5350 following loads 5351 will not see stale 5352 global data. 5353 5354 fence acquire - singlethread *none* *none* 5355 - wavefront 5356 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5357 5358 - If OpenCL and 5359 address space is 5360 not generic, omit. 5361 - However, since LLVM 5362 currently has no 5363 address space on 5364 the fence need to 5365 conservatively 5366 always generate. If 5367 fence had an 5368 address space then 5369 set to address 5370 space of OpenCL 5371 fence flag, or to 5372 generic if both 5373 local and global 5374 flags are 5375 specified. 5376 - Must happen after 5377 any preceding 5378 local/generic load 5379 atomic/atomicrmw 5380 with an equal or 5381 wider sync scope 5382 and memory ordering 5383 stronger than 5384 unordered (this is 5385 termed the 5386 fence-paired-atomic). 5387 - Must happen before 5388 any following 5389 global/generic 5390 load/load 5391 atomic/store/store 5392 atomic/atomicrmw. 5393 - Ensures any 5394 following global 5395 data read is no 5396 older than the 5397 value read by the 5398 fence-paired-atomic. 5399 5400 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5401 - system vmcnt(0) 5402 5403 - If OpenCL and 5404 address space is 5405 not generic, omit 5406 lgkmcnt(0). 5407 - However, since LLVM 5408 currently has no 5409 address space on 5410 the fence need to 5411 conservatively 5412 always generate 5413 (see comment for 5414 previous fence). 5415 - Could be split into 5416 separate s_waitcnt 5417 vmcnt(0) and 5418 s_waitcnt 5419 lgkmcnt(0) to allow 5420 them to be 5421 independently moved 5422 according to the 5423 following rules. 5424 - s_waitcnt vmcnt(0) 5425 must happen after 5426 any preceding 5427 global/generic load 5428 atomic/atomicrmw 5429 with an equal or 5430 wider sync scope 5431 and memory ordering 5432 stronger than 5433 unordered (this is 5434 termed the 5435 fence-paired-atomic). 5436 - s_waitcnt lgkmcnt(0) 5437 must happen after 5438 any preceding 5439 local/generic load 5440 atomic/atomicrmw 5441 with an equal or 5442 wider sync scope 5443 and memory ordering 5444 stronger than 5445 unordered (this is 5446 termed the 5447 fence-paired-atomic). 5448 - Must happen before 5449 the following 5450 buffer_wbinvl1_vol. 5451 - Ensures that the 5452 fence-paired atomic 5453 has completed 5454 before invalidating 5455 the 5456 cache. Therefore 5457 any following 5458 locations read must 5459 be no older than 5460 the value read by 5461 the 5462 fence-paired-atomic. 5463 5464 2. buffer_wbinvl1_vol 5465 5466 - Must happen before any 5467 following global/generic 5468 load/load 5469 atomic/store/store 5470 atomic/atomicrmw. 5471 - Ensures that 5472 following loads 5473 will not see stale 5474 global data. 5475 5476 **Release Atomic** 5477 ------------------------------------------------------------------------------------ 5478 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5479 - wavefront - local 5480 - generic 5481 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5482 - generic 5483 - If OpenCL, omit. 5484 - Must happen after 5485 any preceding 5486 local/generic 5487 load/store/load 5488 atomic/store 5489 atomic/atomicrmw. 5490 - Must happen before 5491 the following 5492 store. 5493 - Ensures that all 5494 memory operations 5495 to local have 5496 completed before 5497 performing the 5498 store that is being 5499 released. 5500 5501 2. buffer/global/flat_store 5502 store atomic release - workgroup - local 1. ds_store 5503 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5504 - system - generic vmcnt(0) 5505 5506 - If OpenCL and 5507 address space is 5508 not generic, omit 5509 lgkmcnt(0). 5510 - Could be split into 5511 separate s_waitcnt 5512 vmcnt(0) and 5513 s_waitcnt 5514 lgkmcnt(0) to allow 5515 them to be 5516 independently moved 5517 according to the 5518 following rules. 5519 - s_waitcnt vmcnt(0) 5520 must happen after 5521 any preceding 5522 global/generic 5523 load/store/load 5524 atomic/store 5525 atomic/atomicrmw. 5526 - s_waitcnt lgkmcnt(0) 5527 must happen after 5528 any preceding 5529 local/generic 5530 load/store/load 5531 atomic/store 5532 atomic/atomicrmw. 5533 - Must happen before 5534 the following 5535 store. 5536 - Ensures that all 5537 memory operations 5538 to memory have 5539 completed before 5540 performing the 5541 store that is being 5542 released. 5543 5544 2. buffer/global/flat_store 5545 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5546 - wavefront - local 5547 - generic 5548 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5549 - generic 5550 - If OpenCL, omit. 5551 - Must happen after 5552 any preceding 5553 local/generic 5554 load/store/load 5555 atomic/store 5556 atomic/atomicrmw. 5557 - Must happen before 5558 the following 5559 atomicrmw. 5560 - Ensures that all 5561 memory operations 5562 to local have 5563 completed before 5564 performing the 5565 atomicrmw that is 5566 being released. 5567 5568 2. buffer/global/flat_atomic 5569 atomicrmw release - workgroup - local 1. ds_atomic 5570 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5571 - system - generic vmcnt(0) 5572 5573 - If OpenCL, omit 5574 lgkmcnt(0). 5575 - Could be split into 5576 separate s_waitcnt 5577 vmcnt(0) and 5578 s_waitcnt 5579 lgkmcnt(0) to allow 5580 them to be 5581 independently moved 5582 according to the 5583 following rules. 5584 - s_waitcnt vmcnt(0) 5585 must happen after 5586 any preceding 5587 global/generic 5588 load/store/load 5589 atomic/store 5590 atomic/atomicrmw. 5591 - s_waitcnt lgkmcnt(0) 5592 must happen after 5593 any preceding 5594 local/generic 5595 load/store/load 5596 atomic/store 5597 atomic/atomicrmw. 5598 - Must happen before 5599 the following 5600 atomicrmw. 5601 - Ensures that all 5602 memory operations 5603 to global and local 5604 have completed 5605 before performing 5606 the atomicrmw that 5607 is being released. 5608 5609 2. buffer/global/flat_atomic 5610 fence release - singlethread *none* *none* 5611 - wavefront 5612 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5613 5614 - If OpenCL and 5615 address space is 5616 not generic, omit. 5617 - However, since LLVM 5618 currently has no 5619 address space on 5620 the fence need to 5621 conservatively 5622 always generate. If 5623 fence had an 5624 address space then 5625 set to address 5626 space of OpenCL 5627 fence flag, or to 5628 generic if both 5629 local and global 5630 flags are 5631 specified. 5632 - Must happen after 5633 any preceding 5634 local/generic 5635 load/load 5636 atomic/store/store 5637 atomic/atomicrmw. 5638 - Must happen before 5639 any following store 5640 atomic/atomicrmw 5641 with an equal or 5642 wider sync scope 5643 and memory ordering 5644 stronger than 5645 unordered (this is 5646 termed the 5647 fence-paired-atomic). 5648 - Ensures that all 5649 memory operations 5650 to local have 5651 completed before 5652 performing the 5653 following 5654 fence-paired-atomic. 5655 5656 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5657 - system vmcnt(0) 5658 5659 - If OpenCL and 5660 address space is 5661 not generic, omit 5662 lgkmcnt(0). 5663 - If OpenCL and 5664 address space is 5665 local, omit 5666 vmcnt(0). 5667 - However, since LLVM 5668 currently has no 5669 address space on 5670 the fence need to 5671 conservatively 5672 always generate. If 5673 fence had an 5674 address space then 5675 set to address 5676 space of OpenCL 5677 fence flag, or to 5678 generic if both 5679 local and global 5680 flags are 5681 specified. 5682 - Could be split into 5683 separate s_waitcnt 5684 vmcnt(0) and 5685 s_waitcnt 5686 lgkmcnt(0) to allow 5687 them to be 5688 independently moved 5689 according to the 5690 following rules. 5691 - s_waitcnt vmcnt(0) 5692 must happen after 5693 any preceding 5694 global/generic 5695 load/store/load 5696 atomic/store 5697 atomic/atomicrmw. 5698 - s_waitcnt lgkmcnt(0) 5699 must happen after 5700 any preceding 5701 local/generic 5702 load/store/load 5703 atomic/store 5704 atomic/atomicrmw. 5705 - Must happen before 5706 any following store 5707 atomic/atomicrmw 5708 with an equal or 5709 wider sync scope 5710 and memory ordering 5711 stronger than 5712 unordered (this is 5713 termed the 5714 fence-paired-atomic). 5715 - Ensures that all 5716 memory operations 5717 have 5718 completed before 5719 performing the 5720 following 5721 fence-paired-atomic. 5722 5723 **Acquire-Release Atomic** 5724 ------------------------------------------------------------------------------------ 5725 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5726 - wavefront - local 5727 - generic 5728 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5729 5730 - If OpenCL, omit. 5731 - Must happen after 5732 any preceding 5733 local/generic 5734 load/store/load 5735 atomic/store 5736 atomic/atomicrmw. 5737 - Must happen before 5738 the following 5739 atomicrmw. 5740 - Ensures that all 5741 memory operations 5742 to local have 5743 completed before 5744 performing the 5745 atomicrmw that is 5746 being released. 5747 5748 2. buffer/global_atomic 5749 5750 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5751 2. s_waitcnt lgkmcnt(0) 5752 5753 - If OpenCL, omit. 5754 - Must happen before 5755 any following 5756 global/generic 5757 load/load 5758 atomic/store/store 5759 atomic/atomicrmw. 5760 - Ensures any 5761 following global 5762 data read is no 5763 older than the local load 5764 atomic value being 5765 acquired. 5766 5767 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5768 5769 - If OpenCL, omit. 5770 - Must happen after 5771 any preceding 5772 local/generic 5773 load/store/load 5774 atomic/store 5775 atomic/atomicrmw. 5776 - Must happen before 5777 the following 5778 atomicrmw. 5779 - Ensures that all 5780 memory operations 5781 to local have 5782 completed before 5783 performing the 5784 atomicrmw that is 5785 being released. 5786 5787 2. flat_atomic 5788 3. s_waitcnt lgkmcnt(0) 5789 5790 - If OpenCL, omit. 5791 - Must happen before 5792 any following 5793 global/generic 5794 load/load 5795 atomic/store/store 5796 atomic/atomicrmw. 5797 - Ensures any 5798 following global 5799 data read is no 5800 older than a local load 5801 atomic value being 5802 acquired. 5803 5804 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5805 - system vmcnt(0) 5806 5807 - If OpenCL, omit 5808 lgkmcnt(0). 5809 - Could be split into 5810 separate s_waitcnt 5811 vmcnt(0) and 5812 s_waitcnt 5813 lgkmcnt(0) to allow 5814 them to be 5815 independently moved 5816 according to the 5817 following rules. 5818 - s_waitcnt vmcnt(0) 5819 must happen after 5820 any preceding 5821 global/generic 5822 load/store/load 5823 atomic/store 5824 atomic/atomicrmw. 5825 - s_waitcnt lgkmcnt(0) 5826 must happen after 5827 any preceding 5828 local/generic 5829 load/store/load 5830 atomic/store 5831 atomic/atomicrmw. 5832 - Must happen before 5833 the following 5834 atomicrmw. 5835 - Ensures that all 5836 memory operations 5837 to global have 5838 completed before 5839 performing the 5840 atomicrmw that is 5841 being released. 5842 5843 2. buffer/global_atomic 5844 3. s_waitcnt vmcnt(0) 5845 5846 - Must happen before 5847 following 5848 buffer_wbinvl1_vol. 5849 - Ensures the 5850 atomicrmw has 5851 completed before 5852 invalidating the 5853 cache. 5854 5855 4. buffer_wbinvl1_vol 5856 5857 - Must happen before 5858 any following 5859 global/generic 5860 load/load 5861 atomic/atomicrmw. 5862 - Ensures that 5863 following loads 5864 will not see stale 5865 global data. 5866 5867 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5868 - system vmcnt(0) 5869 5870 - If OpenCL, omit 5871 lgkmcnt(0). 5872 - Could be split into 5873 separate s_waitcnt 5874 vmcnt(0) and 5875 s_waitcnt 5876 lgkmcnt(0) to allow 5877 them to be 5878 independently moved 5879 according to the 5880 following rules. 5881 - s_waitcnt vmcnt(0) 5882 must happen after 5883 any preceding 5884 global/generic 5885 load/store/load 5886 atomic/store 5887 atomic/atomicrmw. 5888 - s_waitcnt lgkmcnt(0) 5889 must happen after 5890 any preceding 5891 local/generic 5892 load/store/load 5893 atomic/store 5894 atomic/atomicrmw. 5895 - Must happen before 5896 the following 5897 atomicrmw. 5898 - Ensures that all 5899 memory operations 5900 to global have 5901 completed before 5902 performing the 5903 atomicrmw that is 5904 being released. 5905 5906 2. flat_atomic 5907 3. s_waitcnt vmcnt(0) & 5908 lgkmcnt(0) 5909 5910 - If OpenCL, omit 5911 lgkmcnt(0). 5912 - Must happen before 5913 following 5914 buffer_wbinvl1_vol. 5915 - Ensures the 5916 atomicrmw has 5917 completed before 5918 invalidating the 5919 cache. 5920 5921 4. buffer_wbinvl1_vol 5922 5923 - Must happen before 5924 any following 5925 global/generic 5926 load/load 5927 atomic/atomicrmw. 5928 - Ensures that 5929 following loads 5930 will not see stale 5931 global data. 5932 5933 fence acq_rel - singlethread *none* *none* 5934 - wavefront 5935 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5936 5937 - If OpenCL and 5938 address space is 5939 not generic, omit. 5940 - However, 5941 since LLVM 5942 currently has no 5943 address space on 5944 the fence need to 5945 conservatively 5946 always generate 5947 (see comment for 5948 previous fence). 5949 - Must happen after 5950 any preceding 5951 local/generic 5952 load/load 5953 atomic/store/store 5954 atomic/atomicrmw. 5955 - Must happen before 5956 any following 5957 global/generic 5958 load/load 5959 atomic/store/store 5960 atomic/atomicrmw. 5961 - Ensures that all 5962 memory operations 5963 to local have 5964 completed before 5965 performing any 5966 following global 5967 memory operations. 5968 - Ensures that the 5969 preceding 5970 local/generic load 5971 atomic/atomicrmw 5972 with an equal or 5973 wider sync scope 5974 and memory ordering 5975 stronger than 5976 unordered (this is 5977 termed the 5978 acquire-fence-paired-atomic) 5979 has completed 5980 before following 5981 global memory 5982 operations. This 5983 satisfies the 5984 requirements of 5985 acquire. 5986 - Ensures that all 5987 previous memory 5988 operations have 5989 completed before a 5990 following 5991 local/generic store 5992 atomic/atomicrmw 5993 with an equal or 5994 wider sync scope 5995 and memory ordering 5996 stronger than 5997 unordered (this is 5998 termed the 5999 release-fence-paired-atomic). 6000 This satisfies the 6001 requirements of 6002 release. 6003 6004 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 6005 - system vmcnt(0) 6006 6007 - If OpenCL and 6008 address space is 6009 not generic, omit 6010 lgkmcnt(0). 6011 - However, since LLVM 6012 currently has no 6013 address space on 6014 the fence need to 6015 conservatively 6016 always generate 6017 (see comment for 6018 previous fence). 6019 - Could be split into 6020 separate s_waitcnt 6021 vmcnt(0) and 6022 s_waitcnt 6023 lgkmcnt(0) to allow 6024 them to be 6025 independently moved 6026 according to the 6027 following rules. 6028 - s_waitcnt vmcnt(0) 6029 must happen after 6030 any preceding 6031 global/generic 6032 load/store/load 6033 atomic/store 6034 atomic/atomicrmw. 6035 - s_waitcnt lgkmcnt(0) 6036 must happen after 6037 any preceding 6038 local/generic 6039 load/store/load 6040 atomic/store 6041 atomic/atomicrmw. 6042 - Must happen before 6043 the following 6044 buffer_wbinvl1_vol. 6045 - Ensures that the 6046 preceding 6047 global/local/generic 6048 load 6049 atomic/atomicrmw 6050 with an equal or 6051 wider sync scope 6052 and memory ordering 6053 stronger than 6054 unordered (this is 6055 termed the 6056 acquire-fence-paired-atomic) 6057 has completed 6058 before invalidating 6059 the cache. This 6060 satisfies the 6061 requirements of 6062 acquire. 6063 - Ensures that all 6064 previous memory 6065 operations have 6066 completed before a 6067 following 6068 global/local/generic 6069 store 6070 atomic/atomicrmw 6071 with an equal or 6072 wider sync scope 6073 and memory ordering 6074 stronger than 6075 unordered (this is 6076 termed the 6077 release-fence-paired-atomic). 6078 This satisfies the 6079 requirements of 6080 release. 6081 6082 2. buffer_wbinvl1_vol 6083 6084 - Must happen before 6085 any following 6086 global/generic 6087 load/load 6088 atomic/store/store 6089 atomic/atomicrmw. 6090 - Ensures that 6091 following loads 6092 will not see stale 6093 global data. This 6094 satisfies the 6095 requirements of 6096 acquire. 6097 6098 **Sequential Consistent Atomic** 6099 ------------------------------------------------------------------------------------ 6100 load atomic seq_cst - singlethread - global *Same as corresponding 6101 - wavefront - local load atomic acquire, 6102 - generic except must generate 6103 all instructions even 6104 for OpenCL.* 6105 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 6106 - generic 6107 6108 - Must 6109 happen after 6110 preceding 6111 local/generic load 6112 atomic/store 6113 atomic/atomicrmw 6114 with memory 6115 ordering of seq_cst 6116 and with equal or 6117 wider sync scope. 6118 (Note that seq_cst 6119 fences have their 6120 own s_waitcnt 6121 lgkmcnt(0) and so do 6122 not need to be 6123 considered.) 6124 - Ensures any 6125 preceding 6126 sequential 6127 consistent local 6128 memory instructions 6129 have completed 6130 before executing 6131 this sequentially 6132 consistent 6133 instruction. This 6134 prevents reordering 6135 a seq_cst store 6136 followed by a 6137 seq_cst load. (Note 6138 that seq_cst is 6139 stronger than 6140 acquire/release as 6141 the reordering of 6142 load acquire 6143 followed by a store 6144 release is 6145 prevented by the 6146 s_waitcnt of 6147 the release, but 6148 there is nothing 6149 preventing a store 6150 release followed by 6151 load acquire from 6152 completing out of 6153 order. The s_waitcnt 6154 could be placed after 6155 seq_store or before 6156 the seq_load. We 6157 choose the load to 6158 make the s_waitcnt be 6159 as late as possible 6160 so that the store 6161 may have already 6162 completed.) 6163 6164 2. *Following 6165 instructions same as 6166 corresponding load 6167 atomic acquire, 6168 except must generate 6169 all instructions even 6170 for OpenCL.* 6171 load atomic seq_cst - workgroup - local *Same as corresponding 6172 load atomic acquire, 6173 except must generate 6174 all instructions even 6175 for OpenCL.* 6176 6177 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 6178 - system - generic vmcnt(0) 6179 6180 - Could be split into 6181 separate s_waitcnt 6182 vmcnt(0) 6183 and s_waitcnt 6184 lgkmcnt(0) to allow 6185 them to be 6186 independently moved 6187 according to the 6188 following rules. 6189 - s_waitcnt lgkmcnt(0) 6190 must happen after 6191 preceding 6192 global/generic load 6193 atomic/store 6194 atomic/atomicrmw 6195 with memory 6196 ordering of seq_cst 6197 and with equal or 6198 wider sync scope. 6199 (Note that seq_cst 6200 fences have their 6201 own s_waitcnt 6202 lgkmcnt(0) and so do 6203 not need to be 6204 considered.) 6205 - s_waitcnt vmcnt(0) 6206 must happen after 6207 preceding 6208 global/generic load 6209 atomic/store 6210 atomic/atomicrmw 6211 with memory 6212 ordering of seq_cst 6213 and with equal or 6214 wider sync scope. 6215 (Note that seq_cst 6216 fences have their 6217 own s_waitcnt 6218 vmcnt(0) and so do 6219 not need to be 6220 considered.) 6221 - Ensures any 6222 preceding 6223 sequential 6224 consistent global 6225 memory instructions 6226 have completed 6227 before executing 6228 this sequentially 6229 consistent 6230 instruction. This 6231 prevents reordering 6232 a seq_cst store 6233 followed by a 6234 seq_cst load. (Note 6235 that seq_cst is 6236 stronger than 6237 acquire/release as 6238 the reordering of 6239 load acquire 6240 followed by a store 6241 release is 6242 prevented by the 6243 s_waitcnt of 6244 the release, but 6245 there is nothing 6246 preventing a store 6247 release followed by 6248 load acquire from 6249 completing out of 6250 order. The s_waitcnt 6251 could be placed after 6252 seq_store or before 6253 the seq_load. We 6254 choose the load to 6255 make the s_waitcnt be 6256 as late as possible 6257 so that the store 6258 may have already 6259 completed.) 6260 6261 2. *Following 6262 instructions same as 6263 corresponding load 6264 atomic acquire, 6265 except must generate 6266 all instructions even 6267 for OpenCL.* 6268 store atomic seq_cst - singlethread - global *Same as corresponding 6269 - wavefront - local store atomic release, 6270 - workgroup - generic except must generate 6271 - agent all instructions even 6272 - system for OpenCL.* 6273 atomicrmw seq_cst - singlethread - global *Same as corresponding 6274 - wavefront - local atomicrmw acq_rel, 6275 - workgroup - generic except must generate 6276 - agent all instructions even 6277 - system for OpenCL.* 6278 fence seq_cst - singlethread *none* *Same as corresponding 6279 - wavefront fence acq_rel, 6280 - workgroup except must generate 6281 - agent all instructions even 6282 - system for OpenCL.* 6283 ============ ============ ============== ========== ================================ 6284 6285.. _amdgpu-amdhsa-memory-model-gfx90a: 6286 6287Memory Model GFX90A 6288+++++++++++++++++++ 6289 6290For GFX90A: 6291 6292* Each agent has multiple shader arrays (SA). 6293* Each SA has multiple compute units (CU). 6294* Each CU has multiple SIMDs that execute wavefronts. 6295* The wavefronts for a single work-group are executed in the same CU but may be 6296 executed by different SIMDs. The exception is when in tgsplit execution mode 6297 when the wavefronts may be executed by different SIMDs in different CUs. 6298* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6299 executing on it. The exception is when in tgsplit execution mode when no LDS 6300 is allocated as wavefronts of the same work-group can be in different CUs. 6301* All LDS operations of a CU are performed as wavefront wide operations in a 6302 global order and involve no caching. Completion is reported to a wavefront in 6303 execution order. 6304* The LDS memory has multiple request queues shared by the SIMDs of a 6305 CU. Therefore, the LDS operations performed by different wavefronts of a 6306 work-group can be reordered relative to each other, which can result in 6307 reordering the visibility of vector memory operations with respect to LDS 6308 operations of other wavefronts in the same work-group. A ``s_waitcnt 6309 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6310 vector memory operations between wavefronts of a work-group, but not between 6311 operations performed by the same wavefront. 6312* The vector memory operations are performed as wavefront wide operations and 6313 completion is reported to a wavefront in execution order. The exception is 6314 that ``flat_load/store/atomic`` instructions can report out of vector memory 6315 order if they access LDS memory, and out of LDS operation order if they access 6316 global memory. 6317* The vector memory operations access a single vector L1 cache shared by all 6318 SIMDs a CU. Therefore: 6319 6320 * No special action is required for coherence between the lanes of a single 6321 wavefront. 6322 6323 * No special action is required for coherence between wavefronts in the same 6324 work-group since they execute on the same CU. The exception is when in 6325 tgsplit execution mode as wavefronts of the same work-group can be in 6326 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6327 the following item. 6328 6329 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6330 executing in different work-groups as they may be executing on different 6331 CUs. 6332 6333* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6334 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6335 scalar operations are used in a restricted way so do not impact the memory 6336 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6337* The vector and scalar memory operations use an L2 cache shared by all CUs on 6338 the same agent. 6339 6340 * The L2 cache has independent channels to service disjoint ranges of virtual 6341 addresses. 6342 * Each CU has a separate request queue per channel. Therefore, the vector and 6343 scalar memory operations performed by wavefronts executing in different 6344 work-groups (which may be executing on different CUs), or the same 6345 work-group if executing in tgsplit mode, of an agent can be reordered 6346 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6347 synchronization between vector memory operations of different CUs. It 6348 ensures a previous vector memory operation has completed before executing a 6349 subsequent vector memory or LDS operation and so can be used to meet the 6350 requirements of acquire and release. 6351 * The L2 cache of one agent can be kept coherent with other agents by: 6352 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 6353 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 6354 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 6355 6356 * Any local memory cache lines will be automatically invalidated by writes 6357 from CUs associated with other L2 caches, or writes from the CPU, due to 6358 the cache probe caused by coherent requests. Coherent requests are caused 6359 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6360 XGMI, and by PCIe requests that are configured to be coherent requests. 6361 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6362 Subsequent access from the GPU will automatically invalidate or writeback 6363 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6364 * Since all work-groups on the same agent share the same L2, no L2 6365 invalidation or writeback is required for coherence. 6366 * To ensure coherence of local and remote memory writes of work-groups in 6367 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 6368 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 6369 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 6370 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 6371 remote fine grain memory) bypasses the L2, so both will never result in 6372 dirty L2 cache lines. 6373 * To ensure coherence of local and remote memory reads of work-groups in 6374 different agents a ``buffer_invl2`` is required. It will invalidate L2 6375 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 6376 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 6377 coarse memory) cause local reads to be invalidated by remote writes with 6378 with the PTE C-bit so these cache lines are not invalidated. Note that 6379 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 6380 never result in L2 cache lines that need to be invalidated. 6381 6382 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6383 MTYPE UC (uncached) which bypasses the L2. 6384 6385Scalar memory operations are only used to access memory that is proven to not 6386change during the execution of the kernel dispatch. This includes constant 6387address space and global address space for program scope ``const`` variables. 6388Therefore, the kernel machine code does not have to maintain the scalar cache to 6389ensure it is coherent with the vector caches. The scalar and vector caches are 6390invalidated between kernel dispatches by CP since constant address space data 6391may change between kernel dispatch executions. See 6392:ref:`amdgpu-amdhsa-memory-spaces`. 6393 6394The one exception is if scalar writes are used to spill SGPR registers. In this 6395case the AMDGPU backend ensures the memory location used to spill is never 6396accessed by vector memory operations at the same time. If scalar writes are used 6397then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6398return since the locations may be used for vector memory instructions by a 6399future wavefront that uses the same scratch area, or a function call that 6400creates a frame at the same address, respectively. There is no need for a 6401``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6402 6403For kernarg backing memory: 6404 6405* CP invalidates the L1 cache at the start of each kernel dispatch. 6406* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6407 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6408 cache. This also causes it to be treated as non-volatile and so is not 6409 invalidated by ``*_vol``. 6410* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6411 so the L2 cache will be coherent with the CPU and other agents. 6412 6413Scratch backing memory (which is used for the private address space) is accessed 6414with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6415only accessed by a single thread, and is always write-before-read, there is 6416never a need to invalidate these entries from the L1 cache. Hence all cache 6417invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6418 6419The code sequences used to implement the memory model for GFX90A are defined 6420in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6421 6422 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6423 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6424 6425 ============ ============ ============== ========== ================================ 6426 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6427 Ordering Sync Scope Address GFX90A 6428 Space 6429 ============ ============ ============== ========== ================================ 6430 **Non-Atomic** 6431 ------------------------------------------------------------------------------------ 6432 load *none* *none* - global - !volatile & !nontemporal 6433 - generic 6434 - private 1. buffer/global/flat_load 6435 - constant 6436 - !volatile & nontemporal 6437 6438 1. buffer/global/flat_load 6439 glc=1 slc=1 6440 6441 - volatile 6442 6443 1. buffer/global/flat_load 6444 glc=1 6445 2. s_waitcnt vmcnt(0) 6446 6447 - Must happen before 6448 any following volatile 6449 global/generic 6450 load/store. 6451 - Ensures that 6452 volatile 6453 operations to 6454 different 6455 addresses will not 6456 be reordered by 6457 hardware. 6458 6459 load *none* *none* - local 1. ds_load 6460 store *none* *none* - global - !volatile & !nontemporal 6461 - generic 6462 - private 1. buffer/global/flat_store 6463 - constant 6464 - !volatile & nontemporal 6465 6466 1. buffer/global/flat_store 6467 glc=1 slc=1 6468 6469 - volatile 6470 6471 1. buffer/global/flat_store 6472 2. s_waitcnt vmcnt(0) 6473 6474 - Must happen before 6475 any following volatile 6476 global/generic 6477 load/store. 6478 - Ensures that 6479 volatile 6480 operations to 6481 different 6482 addresses will not 6483 be reordered by 6484 hardware. 6485 6486 store *none* *none* - local 1. ds_store 6487 **Unordered Atomic** 6488 ------------------------------------------------------------------------------------ 6489 load atomic unordered *any* *any* *Same as non-atomic*. 6490 store atomic unordered *any* *any* *Same as non-atomic*. 6491 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6492 **Monotonic Atomic** 6493 ------------------------------------------------------------------------------------ 6494 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6495 - wavefront - generic 6496 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6497 - generic glc=1 6498 6499 - If not TgSplit execution 6500 mode, omit glc=1. 6501 6502 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6503 - wavefront local address space cannot 6504 - workgroup be used.* 6505 6506 1. ds_load 6507 load atomic monotonic - agent - global 1. buffer/global/flat_load 6508 - generic glc=1 6509 load atomic monotonic - system - global 1. buffer/global/flat_load 6510 - generic glc=1 6511 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6512 - wavefront - generic 6513 - workgroup 6514 - agent 6515 store atomic monotonic - system - global 1. buffer/global/flat_store 6516 - generic 6517 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6518 - wavefront local address space cannot 6519 - workgroup be used.* 6520 6521 1. ds_store 6522 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6523 - wavefront - generic 6524 - workgroup 6525 - agent 6526 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6527 - generic 6528 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6529 - wavefront local address space cannot 6530 - workgroup be used.* 6531 6532 1. ds_atomic 6533 **Acquire Atomic** 6534 ------------------------------------------------------------------------------------ 6535 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6536 - wavefront - local 6537 - generic 6538 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6539 6540 - If not TgSplit execution 6541 mode, omit glc=1. 6542 6543 2. s_waitcnt vmcnt(0) 6544 6545 - If not TgSplit execution 6546 mode, omit. 6547 - Must happen before the 6548 following buffer_wbinvl1_vol. 6549 6550 3. buffer_wbinvl1_vol 6551 6552 - If not TgSplit execution 6553 mode, omit. 6554 - Must happen before 6555 any following 6556 global/generic 6557 load/load 6558 atomic/store/store 6559 atomic/atomicrmw. 6560 - Ensures that 6561 following 6562 loads will not see 6563 stale data. 6564 6565 load atomic acquire - workgroup - local *If TgSplit execution mode, 6566 local address space cannot 6567 be used.* 6568 6569 1. ds_load 6570 2. s_waitcnt lgkmcnt(0) 6571 6572 - If OpenCL, omit. 6573 - Must happen before 6574 any following 6575 global/generic 6576 load/load 6577 atomic/store/store 6578 atomic/atomicrmw. 6579 - Ensures any 6580 following global 6581 data read is no 6582 older than the local load 6583 atomic value being 6584 acquired. 6585 6586 load atomic acquire - workgroup - generic 1. flat_load glc=1 6587 6588 - If not TgSplit execution 6589 mode, omit glc=1. 6590 6591 2. s_waitcnt lgkm/vmcnt(0) 6592 6593 - Use lgkmcnt(0) if not 6594 TgSplit execution mode 6595 and vmcnt(0) if TgSplit 6596 execution mode. 6597 - If OpenCL, omit lgkmcnt(0). 6598 - Must happen before 6599 the following 6600 buffer_wbinvl1_vol and any 6601 following global/generic 6602 load/load 6603 atomic/store/store 6604 atomic/atomicrmw. 6605 - Ensures any 6606 following global 6607 data read is no 6608 older than a local load 6609 atomic value being 6610 acquired. 6611 6612 3. buffer_wbinvl1_vol 6613 6614 - If not TgSplit execution 6615 mode, omit. 6616 - Ensures that 6617 following 6618 loads will not see 6619 stale data. 6620 6621 load atomic acquire - agent - global 1. buffer/global_load 6622 glc=1 6623 2. s_waitcnt vmcnt(0) 6624 6625 - Must happen before 6626 following 6627 buffer_wbinvl1_vol. 6628 - Ensures the load 6629 has completed 6630 before invalidating 6631 the cache. 6632 6633 3. buffer_wbinvl1_vol 6634 6635 - Must happen before 6636 any following 6637 global/generic 6638 load/load 6639 atomic/atomicrmw. 6640 - Ensures that 6641 following 6642 loads will not see 6643 stale global data. 6644 6645 load atomic acquire - system - global 1. buffer/global/flat_load 6646 glc=1 6647 2. s_waitcnt vmcnt(0) 6648 6649 - Must happen before 6650 following buffer_invl2 and 6651 buffer_wbinvl1_vol. 6652 - Ensures the load 6653 has completed 6654 before invalidating 6655 the cache. 6656 6657 3. buffer_invl2; 6658 buffer_wbinvl1_vol 6659 6660 - Must happen before 6661 any following 6662 global/generic 6663 load/load 6664 atomic/atomicrmw. 6665 - Ensures that 6666 following 6667 loads will not see 6668 stale L1 global data, 6669 nor see stale L2 MTYPE 6670 NC global data. 6671 MTYPE RW and CC memory will 6672 never be stale in L2 due to 6673 the memory probes. 6674 6675 load atomic acquire - agent - generic 1. flat_load glc=1 6676 2. s_waitcnt vmcnt(0) & 6677 lgkmcnt(0) 6678 6679 - If TgSplit execution mode, 6680 omit lgkmcnt(0). 6681 - If OpenCL omit 6682 lgkmcnt(0). 6683 - Must happen before 6684 following 6685 buffer_wbinvl1_vol. 6686 - Ensures the flat_load 6687 has completed 6688 before invalidating 6689 the cache. 6690 6691 3. buffer_wbinvl1_vol 6692 6693 - Must happen before 6694 any following 6695 global/generic 6696 load/load 6697 atomic/atomicrmw. 6698 - Ensures that 6699 following loads 6700 will not see stale 6701 global data. 6702 6703 load atomic acquire - system - generic 1. flat_load glc=1 6704 2. s_waitcnt vmcnt(0) & 6705 lgkmcnt(0) 6706 6707 - If TgSplit execution mode, 6708 omit lgkmcnt(0). 6709 - If OpenCL omit 6710 lgkmcnt(0). 6711 - Must happen before 6712 following 6713 buffer_invl2 and 6714 buffer_wbinvl1_vol. 6715 - Ensures the flat_load 6716 has completed 6717 before invalidating 6718 the caches. 6719 6720 3. buffer_invl2; 6721 buffer_wbinvl1_vol 6722 6723 - Must happen before 6724 any following 6725 global/generic 6726 load/load 6727 atomic/atomicrmw. 6728 - Ensures that 6729 following 6730 loads will not see 6731 stale L1 global data, 6732 nor see stale L2 MTYPE 6733 NC global data. 6734 MTYPE RW and CC memory will 6735 never be stale in L2 due to 6736 the memory probes. 6737 6738 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6739 - wavefront - generic 6740 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6741 - wavefront local address space cannot 6742 be used.* 6743 6744 1. ds_atomic 6745 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6746 2. s_waitcnt vmcnt(0) 6747 6748 - If not TgSplit execution 6749 mode, omit. 6750 - Must happen before the 6751 following buffer_wbinvl1_vol. 6752 - Ensures the atomicrmw 6753 has completed 6754 before invalidating 6755 the cache. 6756 6757 3. buffer_wbinvl1_vol 6758 6759 - If not TgSplit execution 6760 mode, omit. 6761 - Must happen before 6762 any following 6763 global/generic 6764 load/load 6765 atomic/atomicrmw. 6766 - Ensures that 6767 following loads 6768 will not see stale 6769 global data. 6770 6771 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6772 local address space cannot 6773 be used.* 6774 6775 1. ds_atomic 6776 2. s_waitcnt lgkmcnt(0) 6777 6778 - If OpenCL, omit. 6779 - Must happen before 6780 any following 6781 global/generic 6782 load/load 6783 atomic/store/store 6784 atomic/atomicrmw. 6785 - Ensures any 6786 following global 6787 data read is no 6788 older than the local 6789 atomicrmw value 6790 being acquired. 6791 6792 atomicrmw acquire - workgroup - generic 1. flat_atomic 6793 2. s_waitcnt lgkm/vmcnt(0) 6794 6795 - Use lgkmcnt(0) if not 6796 TgSplit execution mode 6797 and vmcnt(0) if TgSplit 6798 execution mode. 6799 - If OpenCL, omit lgkmcnt(0). 6800 - Must happen before 6801 the following 6802 buffer_wbinvl1_vol and 6803 any following 6804 global/generic 6805 load/load 6806 atomic/store/store 6807 atomic/atomicrmw. 6808 - Ensures any 6809 following global 6810 data read is no 6811 older than a local 6812 atomicrmw value 6813 being acquired. 6814 6815 3. buffer_wbinvl1_vol 6816 6817 - If not TgSplit execution 6818 mode, omit. 6819 - Ensures that 6820 following 6821 loads will not see 6822 stale data. 6823 6824 atomicrmw acquire - agent - global 1. buffer/global_atomic 6825 2. s_waitcnt vmcnt(0) 6826 6827 - Must happen before 6828 following 6829 buffer_wbinvl1_vol. 6830 - Ensures the 6831 atomicrmw has 6832 completed before 6833 invalidating the 6834 cache. 6835 6836 3. buffer_wbinvl1_vol 6837 6838 - Must happen before 6839 any following 6840 global/generic 6841 load/load 6842 atomic/atomicrmw. 6843 - Ensures that 6844 following loads 6845 will not see stale 6846 global data. 6847 6848 atomicrmw acquire - system - global 1. buffer/global_atomic 6849 2. s_waitcnt vmcnt(0) 6850 6851 - Must happen before 6852 following buffer_invl2 and 6853 buffer_wbinvl1_vol. 6854 - Ensures the 6855 atomicrmw has 6856 completed before 6857 invalidating the 6858 caches. 6859 6860 3. buffer_invl2; 6861 buffer_wbinvl1_vol 6862 6863 - Must happen before 6864 any following 6865 global/generic 6866 load/load 6867 atomic/atomicrmw. 6868 - Ensures that 6869 following 6870 loads will not see 6871 stale L1 global data, 6872 nor see stale L2 MTYPE 6873 NC global data. 6874 MTYPE RW and CC memory will 6875 never be stale in L2 due to 6876 the memory probes. 6877 6878 atomicrmw acquire - agent - generic 1. flat_atomic 6879 2. s_waitcnt vmcnt(0) & 6880 lgkmcnt(0) 6881 6882 - If TgSplit execution mode, 6883 omit lgkmcnt(0). 6884 - If OpenCL, omit 6885 lgkmcnt(0). 6886 - Must happen before 6887 following 6888 buffer_wbinvl1_vol. 6889 - Ensures the 6890 atomicrmw has 6891 completed before 6892 invalidating the 6893 cache. 6894 6895 3. buffer_wbinvl1_vol 6896 6897 - Must happen before 6898 any following 6899 global/generic 6900 load/load 6901 atomic/atomicrmw. 6902 - Ensures that 6903 following loads 6904 will not see stale 6905 global data. 6906 6907 atomicrmw acquire - system - generic 1. flat_atomic 6908 2. s_waitcnt vmcnt(0) & 6909 lgkmcnt(0) 6910 6911 - If TgSplit execution mode, 6912 omit lgkmcnt(0). 6913 - If OpenCL, omit 6914 lgkmcnt(0). 6915 - Must happen before 6916 following 6917 buffer_invl2 and 6918 buffer_wbinvl1_vol. 6919 - Ensures the 6920 atomicrmw has 6921 completed before 6922 invalidating the 6923 caches. 6924 6925 3. buffer_invl2; 6926 buffer_wbinvl1_vol 6927 6928 - Must happen before 6929 any following 6930 global/generic 6931 load/load 6932 atomic/atomicrmw. 6933 - Ensures that 6934 following 6935 loads will not see 6936 stale L1 global data, 6937 nor see stale L2 MTYPE 6938 NC global data. 6939 MTYPE RW and CC memory will 6940 never be stale in L2 due to 6941 the memory probes. 6942 6943 fence acquire - singlethread *none* *none* 6944 - wavefront 6945 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 6946 6947 - Use lgkmcnt(0) if not 6948 TgSplit execution mode 6949 and vmcnt(0) if TgSplit 6950 execution mode. 6951 - If OpenCL and 6952 address space is 6953 not generic, omit 6954 lgkmcnt(0). 6955 - If OpenCL and 6956 address space is 6957 local, omit 6958 vmcnt(0). 6959 - However, since LLVM 6960 currently has no 6961 address space on 6962 the fence need to 6963 conservatively 6964 always generate. If 6965 fence had an 6966 address space then 6967 set to address 6968 space of OpenCL 6969 fence flag, or to 6970 generic if both 6971 local and global 6972 flags are 6973 specified. 6974 - s_waitcnt vmcnt(0) 6975 must happen after 6976 any preceding 6977 global/generic load 6978 atomic/ 6979 atomicrmw 6980 with an equal or 6981 wider sync scope 6982 and memory ordering 6983 stronger than 6984 unordered (this is 6985 termed the 6986 fence-paired-atomic). 6987 - s_waitcnt lgkmcnt(0) 6988 must happen after 6989 any preceding 6990 local/generic load 6991 atomic/atomicrmw 6992 with an equal or 6993 wider sync scope 6994 and memory ordering 6995 stronger than 6996 unordered (this is 6997 termed the 6998 fence-paired-atomic). 6999 - Must happen before 7000 the following 7001 buffer_wbinvl1_vol and 7002 any following 7003 global/generic 7004 load/load 7005 atomic/store/store 7006 atomic/atomicrmw. 7007 - Ensures any 7008 following global 7009 data read is no 7010 older than the 7011 value read by the 7012 fence-paired-atomic. 7013 7014 2. buffer_wbinvl1_vol 7015 7016 - If not TgSplit execution 7017 mode, omit. 7018 - Ensures that 7019 following 7020 loads will not see 7021 stale data. 7022 7023 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 7024 vmcnt(0) 7025 7026 - If TgSplit execution mode, 7027 omit lgkmcnt(0). 7028 - If OpenCL and 7029 address space is 7030 not generic, omit 7031 lgkmcnt(0). 7032 - However, since LLVM 7033 currently has no 7034 address space on 7035 the fence need to 7036 conservatively 7037 always generate 7038 (see comment for 7039 previous fence). 7040 - Could be split into 7041 separate s_waitcnt 7042 vmcnt(0) and 7043 s_waitcnt 7044 lgkmcnt(0) to allow 7045 them to be 7046 independently moved 7047 according to the 7048 following rules. 7049 - s_waitcnt vmcnt(0) 7050 must happen after 7051 any preceding 7052 global/generic load 7053 atomic/atomicrmw 7054 with an equal or 7055 wider sync scope 7056 and memory ordering 7057 stronger than 7058 unordered (this is 7059 termed the 7060 fence-paired-atomic). 7061 - s_waitcnt lgkmcnt(0) 7062 must happen after 7063 any preceding 7064 local/generic load 7065 atomic/atomicrmw 7066 with an equal or 7067 wider sync scope 7068 and memory ordering 7069 stronger than 7070 unordered (this is 7071 termed the 7072 fence-paired-atomic). 7073 - Must happen before 7074 the following 7075 buffer_wbinvl1_vol. 7076 - Ensures that the 7077 fence-paired atomic 7078 has completed 7079 before invalidating 7080 the 7081 cache. Therefore 7082 any following 7083 locations read must 7084 be no older than 7085 the value read by 7086 the 7087 fence-paired-atomic. 7088 7089 2. buffer_wbinvl1_vol 7090 7091 - Must happen before any 7092 following global/generic 7093 load/load 7094 atomic/store/store 7095 atomic/atomicrmw. 7096 - Ensures that 7097 following loads 7098 will not see stale 7099 global data. 7100 7101 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 7102 vmcnt(0) 7103 7104 - If TgSplit execution mode, 7105 omit lgkmcnt(0). 7106 - If OpenCL and 7107 address space is 7108 not generic, omit 7109 lgkmcnt(0). 7110 - However, since LLVM 7111 currently has no 7112 address space on 7113 the fence need to 7114 conservatively 7115 always generate 7116 (see comment for 7117 previous fence). 7118 - Could be split into 7119 separate s_waitcnt 7120 vmcnt(0) and 7121 s_waitcnt 7122 lgkmcnt(0) to allow 7123 them to be 7124 independently moved 7125 according to the 7126 following rules. 7127 - s_waitcnt vmcnt(0) 7128 must happen after 7129 any preceding 7130 global/generic load 7131 atomic/atomicrmw 7132 with an equal or 7133 wider sync scope 7134 and memory ordering 7135 stronger than 7136 unordered (this is 7137 termed the 7138 fence-paired-atomic). 7139 - s_waitcnt lgkmcnt(0) 7140 must happen after 7141 any preceding 7142 local/generic load 7143 atomic/atomicrmw 7144 with an equal or 7145 wider sync scope 7146 and memory ordering 7147 stronger than 7148 unordered (this is 7149 termed the 7150 fence-paired-atomic). 7151 - Must happen before 7152 the following buffer_invl2 and 7153 buffer_wbinvl1_vol. 7154 - Ensures that the 7155 fence-paired atomic 7156 has completed 7157 before invalidating 7158 the 7159 cache. Therefore 7160 any following 7161 locations read must 7162 be no older than 7163 the value read by 7164 the 7165 fence-paired-atomic. 7166 7167 2. buffer_invl2; 7168 buffer_wbinvl1_vol 7169 7170 - Must happen before any 7171 following global/generic 7172 load/load 7173 atomic/store/store 7174 atomic/atomicrmw. 7175 - Ensures that 7176 following 7177 loads will not see 7178 stale L1 global data, 7179 nor see stale L2 MTYPE 7180 NC global data. 7181 MTYPE RW and CC memory will 7182 never be stale in L2 due to 7183 the memory probes. 7184 **Release Atomic** 7185 ------------------------------------------------------------------------------------ 7186 store atomic release - singlethread - global 1. buffer/global/flat_store 7187 - wavefront - generic 7188 store atomic release - singlethread - local *If TgSplit execution mode, 7189 - wavefront local address space cannot 7190 be used.* 7191 7192 1. ds_store 7193 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7194 - generic 7195 - Use lgkmcnt(0) if not 7196 TgSplit execution mode 7197 and vmcnt(0) if TgSplit 7198 execution mode. 7199 - If OpenCL, omit lgkmcnt(0). 7200 - s_waitcnt vmcnt(0) 7201 must happen after 7202 any preceding 7203 global/generic load/store/ 7204 load atomic/store atomic/ 7205 atomicrmw. 7206 - s_waitcnt lgkmcnt(0) 7207 must happen after 7208 any preceding 7209 local/generic 7210 load/store/load 7211 atomic/store 7212 atomic/atomicrmw. 7213 - Must happen before 7214 the following 7215 store. 7216 - Ensures that all 7217 memory operations 7218 have 7219 completed before 7220 performing the 7221 store that is being 7222 released. 7223 7224 2. buffer/global/flat_store 7225 store atomic release - workgroup - local *If TgSplit execution mode, 7226 local address space cannot 7227 be used.* 7228 7229 1. ds_store 7230 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 7231 - generic vmcnt(0) 7232 7233 - If TgSplit execution mode, 7234 omit lgkmcnt(0). 7235 - If OpenCL and 7236 address space is 7237 not generic, omit 7238 lgkmcnt(0). 7239 - Could be split into 7240 separate s_waitcnt 7241 vmcnt(0) and 7242 s_waitcnt 7243 lgkmcnt(0) to allow 7244 them to be 7245 independently moved 7246 according to the 7247 following rules. 7248 - s_waitcnt vmcnt(0) 7249 must happen after 7250 any preceding 7251 global/generic 7252 load/store/load 7253 atomic/store 7254 atomic/atomicrmw. 7255 - s_waitcnt lgkmcnt(0) 7256 must happen after 7257 any preceding 7258 local/generic 7259 load/store/load 7260 atomic/store 7261 atomic/atomicrmw. 7262 - Must happen before 7263 the following 7264 store. 7265 - Ensures that all 7266 memory operations 7267 to memory have 7268 completed before 7269 performing the 7270 store that is being 7271 released. 7272 7273 2. buffer/global/flat_store 7274 store atomic release - system - global 1. buffer_wbl2 7275 - generic 7276 - Must happen before 7277 following s_waitcnt. 7278 - Performs L2 writeback to 7279 ensure previous 7280 global/generic 7281 store/atomicrmw are 7282 visible at system scope. 7283 7284 2. s_waitcnt lgkmcnt(0) & 7285 vmcnt(0) 7286 7287 - If TgSplit execution mode, 7288 omit lgkmcnt(0). 7289 - If OpenCL and 7290 address space is 7291 not generic, omit 7292 lgkmcnt(0). 7293 - Could be split into 7294 separate s_waitcnt 7295 vmcnt(0) and 7296 s_waitcnt 7297 lgkmcnt(0) to allow 7298 them to be 7299 independently moved 7300 according to the 7301 following rules. 7302 - s_waitcnt vmcnt(0) 7303 must happen after any 7304 preceding 7305 global/generic 7306 load/store/load 7307 atomic/store 7308 atomic/atomicrmw. 7309 - s_waitcnt lgkmcnt(0) 7310 must happen after any 7311 preceding 7312 local/generic 7313 load/store/load 7314 atomic/store 7315 atomic/atomicrmw. 7316 - Must happen before 7317 the following 7318 store. 7319 - Ensures that all 7320 memory operations 7321 to memory and the L2 7322 writeback have 7323 completed before 7324 performing the 7325 store that is being 7326 released. 7327 7328 3. buffer/global/flat_store 7329 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7330 - wavefront - generic 7331 atomicrmw release - singlethread - local *If TgSplit execution mode, 7332 - wavefront local address space cannot 7333 be used.* 7334 7335 1. ds_atomic 7336 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7337 - generic 7338 - Use lgkmcnt(0) if not 7339 TgSplit execution mode 7340 and vmcnt(0) if TgSplit 7341 execution mode. 7342 - If OpenCL, omit 7343 lgkmcnt(0). 7344 - s_waitcnt vmcnt(0) 7345 must happen after 7346 any preceding 7347 global/generic load/store/ 7348 load atomic/store atomic/ 7349 atomicrmw. 7350 - s_waitcnt lgkmcnt(0) 7351 must happen after 7352 any preceding 7353 local/generic 7354 load/store/load 7355 atomic/store 7356 atomic/atomicrmw. 7357 - Must happen before 7358 the following 7359 atomicrmw. 7360 - Ensures that all 7361 memory operations 7362 have 7363 completed before 7364 performing the 7365 atomicrmw that is 7366 being released. 7367 7368 2. buffer/global/flat_atomic 7369 atomicrmw release - workgroup - local *If TgSplit execution mode, 7370 local address space cannot 7371 be used.* 7372 7373 1. ds_atomic 7374 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7375 - generic vmcnt(0) 7376 7377 - If TgSplit execution mode, 7378 omit lgkmcnt(0). 7379 - If OpenCL, omit 7380 lgkmcnt(0). 7381 - Could be split into 7382 separate s_waitcnt 7383 vmcnt(0) and 7384 s_waitcnt 7385 lgkmcnt(0) to allow 7386 them to be 7387 independently moved 7388 according to the 7389 following rules. 7390 - s_waitcnt vmcnt(0) 7391 must happen after 7392 any preceding 7393 global/generic 7394 load/store/load 7395 atomic/store 7396 atomic/atomicrmw. 7397 - s_waitcnt lgkmcnt(0) 7398 must happen after 7399 any preceding 7400 local/generic 7401 load/store/load 7402 atomic/store 7403 atomic/atomicrmw. 7404 - Must happen before 7405 the following 7406 atomicrmw. 7407 - Ensures that all 7408 memory operations 7409 to global and local 7410 have completed 7411 before performing 7412 the atomicrmw that 7413 is being released. 7414 7415 2. buffer/global/flat_atomic 7416 atomicrmw release - system - global 1. buffer_wbl2 7417 - generic 7418 - Must happen before 7419 following s_waitcnt. 7420 - Performs L2 writeback to 7421 ensure previous 7422 global/generic 7423 store/atomicrmw are 7424 visible at system scope. 7425 7426 2. s_waitcnt lgkmcnt(0) & 7427 vmcnt(0) 7428 7429 - If TgSplit execution mode, 7430 omit lgkmcnt(0). 7431 - If OpenCL, omit 7432 lgkmcnt(0). 7433 - Could be split into 7434 separate s_waitcnt 7435 vmcnt(0) and 7436 s_waitcnt 7437 lgkmcnt(0) to allow 7438 them to be 7439 independently moved 7440 according to the 7441 following rules. 7442 - s_waitcnt vmcnt(0) 7443 must happen after 7444 any preceding 7445 global/generic 7446 load/store/load 7447 atomic/store 7448 atomic/atomicrmw. 7449 - s_waitcnt lgkmcnt(0) 7450 must happen after 7451 any preceding 7452 local/generic 7453 load/store/load 7454 atomic/store 7455 atomic/atomicrmw. 7456 - Must happen before 7457 the following 7458 atomicrmw. 7459 - Ensures that all 7460 memory operations 7461 to memory and the L2 7462 writeback have 7463 completed before 7464 performing the 7465 store that is being 7466 released. 7467 7468 3. buffer/global/flat_atomic 7469 fence release - singlethread *none* *none* 7470 - wavefront 7471 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7472 7473 - Use lgkmcnt(0) if not 7474 TgSplit execution mode 7475 and vmcnt(0) if TgSplit 7476 execution mode. 7477 - If OpenCL and 7478 address space is 7479 not generic, omit 7480 lgkmcnt(0). 7481 - If OpenCL and 7482 address space is 7483 local, omit 7484 vmcnt(0). 7485 - However, since LLVM 7486 currently has no 7487 address space on 7488 the fence need to 7489 conservatively 7490 always generate. If 7491 fence had an 7492 address space then 7493 set to address 7494 space of OpenCL 7495 fence flag, or to 7496 generic if both 7497 local and global 7498 flags are 7499 specified. 7500 - s_waitcnt vmcnt(0) 7501 must happen after 7502 any preceding 7503 global/generic 7504 load/store/ 7505 load atomic/store atomic/ 7506 atomicrmw. 7507 - s_waitcnt lgkmcnt(0) 7508 must happen after 7509 any preceding 7510 local/generic 7511 load/load 7512 atomic/store/store 7513 atomic/atomicrmw. 7514 - Must happen before 7515 any following store 7516 atomic/atomicrmw 7517 with an equal or 7518 wider sync scope 7519 and memory ordering 7520 stronger than 7521 unordered (this is 7522 termed the 7523 fence-paired-atomic). 7524 - Ensures that all 7525 memory operations 7526 have 7527 completed before 7528 performing the 7529 following 7530 fence-paired-atomic. 7531 7532 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7533 vmcnt(0) 7534 7535 - If TgSplit execution mode, 7536 omit lgkmcnt(0). 7537 - If OpenCL and 7538 address space is 7539 not generic, omit 7540 lgkmcnt(0). 7541 - If OpenCL and 7542 address space is 7543 local, omit 7544 vmcnt(0). 7545 - However, since LLVM 7546 currently has no 7547 address space on 7548 the fence need to 7549 conservatively 7550 always generate. If 7551 fence had an 7552 address space then 7553 set to address 7554 space of OpenCL 7555 fence flag, or to 7556 generic if both 7557 local and global 7558 flags are 7559 specified. 7560 - Could be split into 7561 separate s_waitcnt 7562 vmcnt(0) and 7563 s_waitcnt 7564 lgkmcnt(0) to allow 7565 them to be 7566 independently moved 7567 according to the 7568 following rules. 7569 - s_waitcnt vmcnt(0) 7570 must happen after 7571 any preceding 7572 global/generic 7573 load/store/load 7574 atomic/store 7575 atomic/atomicrmw. 7576 - s_waitcnt lgkmcnt(0) 7577 must happen after 7578 any preceding 7579 local/generic 7580 load/store/load 7581 atomic/store 7582 atomic/atomicrmw. 7583 - Must happen before 7584 any following store 7585 atomic/atomicrmw 7586 with an equal or 7587 wider sync scope 7588 and memory ordering 7589 stronger than 7590 unordered (this is 7591 termed the 7592 fence-paired-atomic). 7593 - Ensures that all 7594 memory operations 7595 have 7596 completed before 7597 performing the 7598 following 7599 fence-paired-atomic. 7600 7601 fence release - system *none* 1. buffer_wbl2 7602 7603 - If OpenCL and 7604 address space is 7605 local, omit. 7606 - Must happen before 7607 following s_waitcnt. 7608 - Performs L2 writeback to 7609 ensure previous 7610 global/generic 7611 store/atomicrmw are 7612 visible at system scope. 7613 7614 2. s_waitcnt lgkmcnt(0) & 7615 vmcnt(0) 7616 7617 - If TgSplit execution mode, 7618 omit lgkmcnt(0). 7619 - If OpenCL and 7620 address space is 7621 not generic, omit 7622 lgkmcnt(0). 7623 - If OpenCL and 7624 address space is 7625 local, omit 7626 vmcnt(0). 7627 - However, since LLVM 7628 currently has no 7629 address space on 7630 the fence need to 7631 conservatively 7632 always generate. If 7633 fence had an 7634 address space then 7635 set to address 7636 space of OpenCL 7637 fence flag, or to 7638 generic if both 7639 local and global 7640 flags are 7641 specified. 7642 - Could be split into 7643 separate s_waitcnt 7644 vmcnt(0) and 7645 s_waitcnt 7646 lgkmcnt(0) to allow 7647 them to be 7648 independently moved 7649 according to the 7650 following rules. 7651 - s_waitcnt vmcnt(0) 7652 must happen after 7653 any preceding 7654 global/generic 7655 load/store/load 7656 atomic/store 7657 atomic/atomicrmw. 7658 - s_waitcnt lgkmcnt(0) 7659 must happen after 7660 any preceding 7661 local/generic 7662 load/store/load 7663 atomic/store 7664 atomic/atomicrmw. 7665 - Must happen before 7666 any following store 7667 atomic/atomicrmw 7668 with an equal or 7669 wider sync scope 7670 and memory ordering 7671 stronger than 7672 unordered (this is 7673 termed the 7674 fence-paired-atomic). 7675 - Ensures that all 7676 memory operations 7677 have 7678 completed before 7679 performing the 7680 following 7681 fence-paired-atomic. 7682 7683 **Acquire-Release Atomic** 7684 ------------------------------------------------------------------------------------ 7685 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7686 - wavefront - generic 7687 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7688 - wavefront local address space cannot 7689 be used.* 7690 7691 1. ds_atomic 7692 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7693 7694 - Use lgkmcnt(0) if not 7695 TgSplit execution mode 7696 and vmcnt(0) if TgSplit 7697 execution mode. 7698 - If OpenCL, omit 7699 lgkmcnt(0). 7700 - Must happen after 7701 any preceding 7702 local/generic 7703 load/store/load 7704 atomic/store 7705 atomic/atomicrmw. 7706 - s_waitcnt vmcnt(0) 7707 must happen after 7708 any preceding 7709 global/generic load/store/ 7710 load atomic/store atomic/ 7711 atomicrmw. 7712 - s_waitcnt lgkmcnt(0) 7713 must happen after 7714 any preceding 7715 local/generic 7716 load/store/load 7717 atomic/store 7718 atomic/atomicrmw. 7719 - Must happen before 7720 the following 7721 atomicrmw. 7722 - Ensures that all 7723 memory operations 7724 have 7725 completed before 7726 performing the 7727 atomicrmw that is 7728 being released. 7729 7730 2. buffer/global_atomic 7731 3. s_waitcnt vmcnt(0) 7732 7733 - If not TgSplit execution 7734 mode, omit. 7735 - Must happen before 7736 the following 7737 buffer_wbinvl1_vol. 7738 - Ensures any 7739 following global 7740 data read is no 7741 older than the 7742 atomicrmw value 7743 being acquired. 7744 7745 4. buffer_wbinvl1_vol 7746 7747 - If not TgSplit execution 7748 mode, omit. 7749 - Ensures that 7750 following 7751 loads will not see 7752 stale data. 7753 7754 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7755 local address space cannot 7756 be used.* 7757 7758 1. ds_atomic 7759 2. s_waitcnt lgkmcnt(0) 7760 7761 - If OpenCL, omit. 7762 - Must happen before 7763 any following 7764 global/generic 7765 load/load 7766 atomic/store/store 7767 atomic/atomicrmw. 7768 - Ensures any 7769 following global 7770 data read is no 7771 older than the local load 7772 atomic value being 7773 acquired. 7774 7775 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7776 7777 - Use lgkmcnt(0) if not 7778 TgSplit execution mode 7779 and vmcnt(0) if TgSplit 7780 execution mode. 7781 - If OpenCL, omit 7782 lgkmcnt(0). 7783 - s_waitcnt vmcnt(0) 7784 must happen after 7785 any preceding 7786 global/generic load/store/ 7787 load atomic/store atomic/ 7788 atomicrmw. 7789 - s_waitcnt lgkmcnt(0) 7790 must happen after 7791 any preceding 7792 local/generic 7793 load/store/load 7794 atomic/store 7795 atomic/atomicrmw. 7796 - Must happen before 7797 the following 7798 atomicrmw. 7799 - Ensures that all 7800 memory operations 7801 have 7802 completed before 7803 performing the 7804 atomicrmw that is 7805 being released. 7806 7807 2. flat_atomic 7808 3. s_waitcnt lgkmcnt(0) & 7809 vmcnt(0) 7810 7811 - If not TgSplit execution 7812 mode, omit vmcnt(0). 7813 - If OpenCL, omit 7814 lgkmcnt(0). 7815 - Must happen before 7816 the following 7817 buffer_wbinvl1_vol and 7818 any following 7819 global/generic 7820 load/load 7821 atomic/store/store 7822 atomic/atomicrmw. 7823 - Ensures any 7824 following global 7825 data read is no 7826 older than a local load 7827 atomic value being 7828 acquired. 7829 7830 3. buffer_wbinvl1_vol 7831 7832 - If not TgSplit execution 7833 mode, omit. 7834 - Ensures that 7835 following 7836 loads will not see 7837 stale data. 7838 7839 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7840 vmcnt(0) 7841 7842 - If TgSplit execution mode, 7843 omit lgkmcnt(0). 7844 - If OpenCL, omit 7845 lgkmcnt(0). 7846 - Could be split into 7847 separate s_waitcnt 7848 vmcnt(0) and 7849 s_waitcnt 7850 lgkmcnt(0) to allow 7851 them to be 7852 independently moved 7853 according to the 7854 following rules. 7855 - s_waitcnt vmcnt(0) 7856 must happen after 7857 any preceding 7858 global/generic 7859 load/store/load 7860 atomic/store 7861 atomic/atomicrmw. 7862 - s_waitcnt lgkmcnt(0) 7863 must happen after 7864 any preceding 7865 local/generic 7866 load/store/load 7867 atomic/store 7868 atomic/atomicrmw. 7869 - Must happen before 7870 the following 7871 atomicrmw. 7872 - Ensures that all 7873 memory operations 7874 to global have 7875 completed before 7876 performing the 7877 atomicrmw that is 7878 being released. 7879 7880 2. buffer/global_atomic 7881 3. s_waitcnt vmcnt(0) 7882 7883 - Must happen before 7884 following 7885 buffer_wbinvl1_vol. 7886 - Ensures the 7887 atomicrmw has 7888 completed before 7889 invalidating the 7890 cache. 7891 7892 4. buffer_wbinvl1_vol 7893 7894 - Must happen before 7895 any following 7896 global/generic 7897 load/load 7898 atomic/atomicrmw. 7899 - Ensures that 7900 following loads 7901 will not see stale 7902 global data. 7903 7904 atomicrmw acq_rel - system - global 1. buffer_wbl2 7905 7906 - Must happen before 7907 following s_waitcnt. 7908 - Performs L2 writeback to 7909 ensure previous 7910 global/generic 7911 store/atomicrmw are 7912 visible at system scope. 7913 7914 2. s_waitcnt lgkmcnt(0) & 7915 vmcnt(0) 7916 7917 - If TgSplit execution mode, 7918 omit lgkmcnt(0). 7919 - If OpenCL, omit 7920 lgkmcnt(0). 7921 - Could be split into 7922 separate s_waitcnt 7923 vmcnt(0) and 7924 s_waitcnt 7925 lgkmcnt(0) to allow 7926 them to be 7927 independently moved 7928 according to the 7929 following rules. 7930 - s_waitcnt vmcnt(0) 7931 must happen after 7932 any preceding 7933 global/generic 7934 load/store/load 7935 atomic/store 7936 atomic/atomicrmw. 7937 - s_waitcnt lgkmcnt(0) 7938 must happen after 7939 any preceding 7940 local/generic 7941 load/store/load 7942 atomic/store 7943 atomic/atomicrmw. 7944 - Must happen before 7945 the following 7946 atomicrmw. 7947 - Ensures that all 7948 memory operations 7949 to global and L2 writeback 7950 have completed before 7951 performing the 7952 atomicrmw that is 7953 being released. 7954 7955 3. buffer/global_atomic 7956 4. s_waitcnt vmcnt(0) 7957 7958 - Must happen before 7959 following buffer_invl2 and 7960 buffer_wbinvl1_vol. 7961 - Ensures the 7962 atomicrmw has 7963 completed before 7964 invalidating the 7965 caches. 7966 7967 5. buffer_invl2; 7968 buffer_wbinvl1_vol 7969 7970 - Must happen before 7971 any following 7972 global/generic 7973 load/load 7974 atomic/atomicrmw. 7975 - Ensures that 7976 following 7977 loads will not see 7978 stale L1 global data, 7979 nor see stale L2 MTYPE 7980 NC global data. 7981 MTYPE RW and CC memory will 7982 never be stale in L2 due to 7983 the memory probes. 7984 7985 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7986 vmcnt(0) 7987 7988 - If TgSplit execution mode, 7989 omit lgkmcnt(0). 7990 - If OpenCL, omit 7991 lgkmcnt(0). 7992 - Could be split into 7993 separate s_waitcnt 7994 vmcnt(0) and 7995 s_waitcnt 7996 lgkmcnt(0) to allow 7997 them to be 7998 independently moved 7999 according to the 8000 following rules. 8001 - s_waitcnt vmcnt(0) 8002 must happen after 8003 any preceding 8004 global/generic 8005 load/store/load 8006 atomic/store 8007 atomic/atomicrmw. 8008 - s_waitcnt lgkmcnt(0) 8009 must happen after 8010 any preceding 8011 local/generic 8012 load/store/load 8013 atomic/store 8014 atomic/atomicrmw. 8015 - Must happen before 8016 the following 8017 atomicrmw. 8018 - Ensures that all 8019 memory operations 8020 to global have 8021 completed before 8022 performing the 8023 atomicrmw that is 8024 being released. 8025 8026 2. flat_atomic 8027 3. s_waitcnt vmcnt(0) & 8028 lgkmcnt(0) 8029 8030 - If TgSplit execution mode, 8031 omit lgkmcnt(0). 8032 - If OpenCL, omit 8033 lgkmcnt(0). 8034 - Must happen before 8035 following 8036 buffer_wbinvl1_vol. 8037 - Ensures the 8038 atomicrmw has 8039 completed before 8040 invalidating the 8041 cache. 8042 8043 4. buffer_wbinvl1_vol 8044 8045 - Must happen before 8046 any following 8047 global/generic 8048 load/load 8049 atomic/atomicrmw. 8050 - Ensures that 8051 following loads 8052 will not see stale 8053 global data. 8054 8055 atomicrmw acq_rel - system - generic 1. buffer_wbl2 8056 8057 - Must happen before 8058 following s_waitcnt. 8059 - Performs L2 writeback to 8060 ensure previous 8061 global/generic 8062 store/atomicrmw are 8063 visible at system scope. 8064 8065 2. s_waitcnt lgkmcnt(0) & 8066 vmcnt(0) 8067 8068 - If TgSplit execution mode, 8069 omit lgkmcnt(0). 8070 - If OpenCL, omit 8071 lgkmcnt(0). 8072 - Could be split into 8073 separate s_waitcnt 8074 vmcnt(0) and 8075 s_waitcnt 8076 lgkmcnt(0) to allow 8077 them to be 8078 independently moved 8079 according to the 8080 following rules. 8081 - s_waitcnt vmcnt(0) 8082 must happen after 8083 any preceding 8084 global/generic 8085 load/store/load 8086 atomic/store 8087 atomic/atomicrmw. 8088 - s_waitcnt lgkmcnt(0) 8089 must happen after 8090 any preceding 8091 local/generic 8092 load/store/load 8093 atomic/store 8094 atomic/atomicrmw. 8095 - Must happen before 8096 the following 8097 atomicrmw. 8098 - Ensures that all 8099 memory operations 8100 to global and L2 writeback 8101 have completed before 8102 performing the 8103 atomicrmw that is 8104 being released. 8105 8106 3. flat_atomic 8107 4. s_waitcnt vmcnt(0) & 8108 lgkmcnt(0) 8109 8110 - If TgSplit execution mode, 8111 omit lgkmcnt(0). 8112 - If OpenCL, omit 8113 lgkmcnt(0). 8114 - Must happen before 8115 following buffer_invl2 and 8116 buffer_wbinvl1_vol. 8117 - Ensures the 8118 atomicrmw has 8119 completed before 8120 invalidating the 8121 caches. 8122 8123 5. buffer_invl2; 8124 buffer_wbinvl1_vol 8125 8126 - Must happen before 8127 any following 8128 global/generic 8129 load/load 8130 atomic/atomicrmw. 8131 - Ensures that 8132 following 8133 loads will not see 8134 stale L1 global data, 8135 nor see stale L2 MTYPE 8136 NC global data. 8137 MTYPE RW and CC memory will 8138 never be stale in L2 due to 8139 the memory probes. 8140 8141 fence acq_rel - singlethread *none* *none* 8142 - wavefront 8143 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 8144 8145 - Use lgkmcnt(0) if not 8146 TgSplit execution mode 8147 and vmcnt(0) if TgSplit 8148 execution mode. 8149 - If OpenCL and 8150 address space is 8151 not generic, omit 8152 lgkmcnt(0). 8153 - If OpenCL and 8154 address space is 8155 local, omit 8156 vmcnt(0). 8157 - However, 8158 since LLVM 8159 currently has no 8160 address space on 8161 the fence need to 8162 conservatively 8163 always generate 8164 (see comment for 8165 previous fence). 8166 - s_waitcnt vmcnt(0) 8167 must happen after 8168 any preceding 8169 global/generic 8170 load/store/ 8171 load atomic/store atomic/ 8172 atomicrmw. 8173 - s_waitcnt lgkmcnt(0) 8174 must happen after 8175 any preceding 8176 local/generic 8177 load/load 8178 atomic/store/store 8179 atomic/atomicrmw. 8180 - Must happen before 8181 any following 8182 global/generic 8183 load/load 8184 atomic/store/store 8185 atomic/atomicrmw. 8186 - Ensures that all 8187 memory operations 8188 have 8189 completed before 8190 performing any 8191 following global 8192 memory operations. 8193 - Ensures that the 8194 preceding 8195 local/generic load 8196 atomic/atomicrmw 8197 with an equal or 8198 wider sync scope 8199 and memory ordering 8200 stronger than 8201 unordered (this is 8202 termed the 8203 acquire-fence-paired-atomic) 8204 has completed 8205 before following 8206 global memory 8207 operations. This 8208 satisfies the 8209 requirements of 8210 acquire. 8211 - Ensures that all 8212 previous memory 8213 operations have 8214 completed before a 8215 following 8216 local/generic store 8217 atomic/atomicrmw 8218 with an equal or 8219 wider sync scope 8220 and memory ordering 8221 stronger than 8222 unordered (this is 8223 termed the 8224 release-fence-paired-atomic). 8225 This satisfies the 8226 requirements of 8227 release. 8228 - Must happen before 8229 the following 8230 buffer_wbinvl1_vol. 8231 - Ensures that the 8232 acquire-fence-paired 8233 atomic has completed 8234 before invalidating 8235 the 8236 cache. Therefore 8237 any following 8238 locations read must 8239 be no older than 8240 the value read by 8241 the 8242 acquire-fence-paired-atomic. 8243 8244 2. buffer_wbinvl1_vol 8245 8246 - If not TgSplit execution 8247 mode, omit. 8248 - Ensures that 8249 following 8250 loads will not see 8251 stale data. 8252 8253 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 8254 vmcnt(0) 8255 8256 - If TgSplit execution mode, 8257 omit lgkmcnt(0). 8258 - If OpenCL and 8259 address space is 8260 not generic, omit 8261 lgkmcnt(0). 8262 - However, since LLVM 8263 currently has no 8264 address space on 8265 the fence need to 8266 conservatively 8267 always generate 8268 (see comment for 8269 previous fence). 8270 - Could be split into 8271 separate s_waitcnt 8272 vmcnt(0) and 8273 s_waitcnt 8274 lgkmcnt(0) to allow 8275 them to be 8276 independently moved 8277 according to the 8278 following rules. 8279 - s_waitcnt vmcnt(0) 8280 must happen after 8281 any preceding 8282 global/generic 8283 load/store/load 8284 atomic/store 8285 atomic/atomicrmw. 8286 - s_waitcnt lgkmcnt(0) 8287 must happen after 8288 any preceding 8289 local/generic 8290 load/store/load 8291 atomic/store 8292 atomic/atomicrmw. 8293 - Must happen before 8294 the following 8295 buffer_wbinvl1_vol. 8296 - Ensures that the 8297 preceding 8298 global/local/generic 8299 load 8300 atomic/atomicrmw 8301 with an equal or 8302 wider sync scope 8303 and memory ordering 8304 stronger than 8305 unordered (this is 8306 termed the 8307 acquire-fence-paired-atomic) 8308 has completed 8309 before invalidating 8310 the cache. This 8311 satisfies the 8312 requirements of 8313 acquire. 8314 - Ensures that all 8315 previous memory 8316 operations have 8317 completed before a 8318 following 8319 global/local/generic 8320 store 8321 atomic/atomicrmw 8322 with an equal or 8323 wider sync scope 8324 and memory ordering 8325 stronger than 8326 unordered (this is 8327 termed the 8328 release-fence-paired-atomic). 8329 This satisfies the 8330 requirements of 8331 release. 8332 8333 2. buffer_wbinvl1_vol 8334 8335 - Must happen before 8336 any following 8337 global/generic 8338 load/load 8339 atomic/store/store 8340 atomic/atomicrmw. 8341 - Ensures that 8342 following loads 8343 will not see stale 8344 global data. This 8345 satisfies the 8346 requirements of 8347 acquire. 8348 8349 fence acq_rel - system *none* 1. buffer_wbl2 8350 8351 - If OpenCL and 8352 address space is 8353 local, omit. 8354 - Must happen before 8355 following s_waitcnt. 8356 - Performs L2 writeback to 8357 ensure previous 8358 global/generic 8359 store/atomicrmw are 8360 visible at system scope. 8361 8362 2. s_waitcnt lgkmcnt(0) & 8363 vmcnt(0) 8364 8365 - If TgSplit execution mode, 8366 omit lgkmcnt(0). 8367 - If OpenCL and 8368 address space is 8369 not generic, omit 8370 lgkmcnt(0). 8371 - However, since LLVM 8372 currently has no 8373 address space on 8374 the fence need to 8375 conservatively 8376 always generate 8377 (see comment for 8378 previous fence). 8379 - Could be split into 8380 separate s_waitcnt 8381 vmcnt(0) and 8382 s_waitcnt 8383 lgkmcnt(0) to allow 8384 them to be 8385 independently moved 8386 according to the 8387 following rules. 8388 - s_waitcnt vmcnt(0) 8389 must happen after 8390 any preceding 8391 global/generic 8392 load/store/load 8393 atomic/store 8394 atomic/atomicrmw. 8395 - s_waitcnt lgkmcnt(0) 8396 must happen after 8397 any preceding 8398 local/generic 8399 load/store/load 8400 atomic/store 8401 atomic/atomicrmw. 8402 - Must happen before 8403 the following buffer_invl2 and 8404 buffer_wbinvl1_vol. 8405 - Ensures that the 8406 preceding 8407 global/local/generic 8408 load 8409 atomic/atomicrmw 8410 with an equal or 8411 wider sync scope 8412 and memory ordering 8413 stronger than 8414 unordered (this is 8415 termed the 8416 acquire-fence-paired-atomic) 8417 has completed 8418 before invalidating 8419 the cache. This 8420 satisfies the 8421 requirements of 8422 acquire. 8423 - Ensures that all 8424 previous memory 8425 operations have 8426 completed before a 8427 following 8428 global/local/generic 8429 store 8430 atomic/atomicrmw 8431 with an equal or 8432 wider sync scope 8433 and memory ordering 8434 stronger than 8435 unordered (this is 8436 termed the 8437 release-fence-paired-atomic). 8438 This satisfies the 8439 requirements of 8440 release. 8441 8442 3. buffer_invl2; 8443 buffer_wbinvl1_vol 8444 8445 - Must happen before 8446 any following 8447 global/generic 8448 load/load 8449 atomic/store/store 8450 atomic/atomicrmw. 8451 - Ensures that 8452 following 8453 loads will not see 8454 stale L1 global data, 8455 nor see stale L2 MTYPE 8456 NC global data. 8457 MTYPE RW and CC memory will 8458 never be stale in L2 due to 8459 the memory probes. 8460 8461 **Sequential Consistent Atomic** 8462 ------------------------------------------------------------------------------------ 8463 load atomic seq_cst - singlethread - global *Same as corresponding 8464 - wavefront - local load atomic acquire, 8465 - generic except must generate 8466 all instructions even 8467 for OpenCL.* 8468 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8469 - generic 8470 - Use lgkmcnt(0) if not 8471 TgSplit execution mode 8472 and vmcnt(0) if TgSplit 8473 execution mode. 8474 - s_waitcnt lgkmcnt(0) must 8475 happen after 8476 preceding 8477 local/generic load 8478 atomic/store 8479 atomic/atomicrmw 8480 with memory 8481 ordering of seq_cst 8482 and with equal or 8483 wider sync scope. 8484 (Note that seq_cst 8485 fences have their 8486 own s_waitcnt 8487 lgkmcnt(0) and so do 8488 not need to be 8489 considered.) 8490 - s_waitcnt vmcnt(0) 8491 must happen after 8492 preceding 8493 global/generic load 8494 atomic/store 8495 atomic/atomicrmw 8496 with memory 8497 ordering of seq_cst 8498 and with equal or 8499 wider sync scope. 8500 (Note that seq_cst 8501 fences have their 8502 own s_waitcnt 8503 vmcnt(0) and so do 8504 not need to be 8505 considered.) 8506 - Ensures any 8507 preceding 8508 sequential 8509 consistent global/local 8510 memory instructions 8511 have completed 8512 before executing 8513 this sequentially 8514 consistent 8515 instruction. This 8516 prevents reordering 8517 a seq_cst store 8518 followed by a 8519 seq_cst load. (Note 8520 that seq_cst is 8521 stronger than 8522 acquire/release as 8523 the reordering of 8524 load acquire 8525 followed by a store 8526 release is 8527 prevented by the 8528 s_waitcnt of 8529 the release, but 8530 there is nothing 8531 preventing a store 8532 release followed by 8533 load acquire from 8534 completing out of 8535 order. The s_waitcnt 8536 could be placed after 8537 seq_store or before 8538 the seq_load. We 8539 choose the load to 8540 make the s_waitcnt be 8541 as late as possible 8542 so that the store 8543 may have already 8544 completed.) 8545 8546 2. *Following 8547 instructions same as 8548 corresponding load 8549 atomic acquire, 8550 except must generate 8551 all instructions even 8552 for OpenCL.* 8553 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8554 local address space cannot 8555 be used.* 8556 8557 *Same as corresponding 8558 load atomic acquire, 8559 except must generate 8560 all instructions even 8561 for OpenCL.* 8562 8563 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8564 - system - generic vmcnt(0) 8565 8566 - If TgSplit execution mode, 8567 omit lgkmcnt(0). 8568 - Could be split into 8569 separate s_waitcnt 8570 vmcnt(0) 8571 and s_waitcnt 8572 lgkmcnt(0) to allow 8573 them to be 8574 independently moved 8575 according to the 8576 following rules. 8577 - s_waitcnt lgkmcnt(0) 8578 must happen after 8579 preceding 8580 global/generic load 8581 atomic/store 8582 atomic/atomicrmw 8583 with memory 8584 ordering of seq_cst 8585 and with equal or 8586 wider sync scope. 8587 (Note that seq_cst 8588 fences have their 8589 own s_waitcnt 8590 lgkmcnt(0) and so do 8591 not need to be 8592 considered.) 8593 - s_waitcnt vmcnt(0) 8594 must happen after 8595 preceding 8596 global/generic load 8597 atomic/store 8598 atomic/atomicrmw 8599 with memory 8600 ordering of seq_cst 8601 and with equal or 8602 wider sync scope. 8603 (Note that seq_cst 8604 fences have their 8605 own s_waitcnt 8606 vmcnt(0) and so do 8607 not need to be 8608 considered.) 8609 - Ensures any 8610 preceding 8611 sequential 8612 consistent global 8613 memory instructions 8614 have completed 8615 before executing 8616 this sequentially 8617 consistent 8618 instruction. This 8619 prevents reordering 8620 a seq_cst store 8621 followed by a 8622 seq_cst load. (Note 8623 that seq_cst is 8624 stronger than 8625 acquire/release as 8626 the reordering of 8627 load acquire 8628 followed by a store 8629 release is 8630 prevented by the 8631 s_waitcnt of 8632 the release, but 8633 there is nothing 8634 preventing a store 8635 release followed by 8636 load acquire from 8637 completing out of 8638 order. The s_waitcnt 8639 could be placed after 8640 seq_store or before 8641 the seq_load. We 8642 choose the load to 8643 make the s_waitcnt be 8644 as late as possible 8645 so that the store 8646 may have already 8647 completed.) 8648 8649 2. *Following 8650 instructions same as 8651 corresponding load 8652 atomic acquire, 8653 except must generate 8654 all instructions even 8655 for OpenCL.* 8656 store atomic seq_cst - singlethread - global *Same as corresponding 8657 - wavefront - local store atomic release, 8658 - workgroup - generic except must generate 8659 - agent all instructions even 8660 - system for OpenCL.* 8661 atomicrmw seq_cst - singlethread - global *Same as corresponding 8662 - wavefront - local atomicrmw acq_rel, 8663 - workgroup - generic except must generate 8664 - agent all instructions even 8665 - system for OpenCL.* 8666 fence seq_cst - singlethread *none* *Same as corresponding 8667 - wavefront fence acq_rel, 8668 - workgroup except must generate 8669 - agent all instructions even 8670 - system for OpenCL.* 8671 ============ ============ ============== ========== ================================ 8672 8673.. _amdgpu-amdhsa-memory-model-gfx940: 8674 8675Memory Model GFX940 8676+++++++++++++++++++ 8677 8678For GFX940: 8679 8680* Each agent has multiple shader arrays (SA). 8681* Each SA has multiple compute units (CU). 8682* Each CU has multiple SIMDs that execute wavefronts. 8683* The wavefronts for a single work-group are executed in the same CU but may be 8684 executed by different SIMDs. The exception is when in tgsplit execution mode 8685 when the wavefronts may be executed by different SIMDs in different CUs. 8686* Each CU has a single LDS memory shared by the wavefronts of the work-groups 8687 executing on it. The exception is when in tgsplit execution mode when no LDS 8688 is allocated as wavefronts of the same work-group can be in different CUs. 8689* All LDS operations of a CU are performed as wavefront wide operations in a 8690 global order and involve no caching. Completion is reported to a wavefront in 8691 execution order. 8692* The LDS memory has multiple request queues shared by the SIMDs of a 8693 CU. Therefore, the LDS operations performed by different wavefronts of a 8694 work-group can be reordered relative to each other, which can result in 8695 reordering the visibility of vector memory operations with respect to LDS 8696 operations of other wavefronts in the same work-group. A ``s_waitcnt 8697 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8698 vector memory operations between wavefronts of a work-group, but not between 8699 operations performed by the same wavefront. 8700* The vector memory operations are performed as wavefront wide operations and 8701 completion is reported to a wavefront in execution order. The exception is 8702 that ``flat_load/store/atomic`` instructions can report out of vector memory 8703 order if they access LDS memory, and out of LDS operation order if they access 8704 global memory. 8705* The vector memory operations access a single vector L1 cache shared by all 8706 SIMDs a CU. Therefore: 8707 8708 * No special action is required for coherence between the lanes of a single 8709 wavefront. 8710 8711 * No special action is required for coherence between wavefronts in the same 8712 work-group since they execute on the same CU. The exception is when in 8713 tgsplit execution mode as wavefronts of the same work-group can be in 8714 different CUs and so a ``buffer_inv sc0`` is required which will invalidate 8715 the L1 cache. 8716 8717 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence 8718 between wavefronts executing in different work-groups as they may be 8719 executing on different CUs. 8720 8721 * Atomic read-modify-write instructions implicitly bypass the L1 cache. 8722 Therefore, they do not use the sc0 bit for coherence and instead use it to 8723 indicate if the instruction returns the original value being updated. They 8724 do use sc1 to indicate system or agent scope coherence. 8725 8726* The scalar memory operations access a scalar L1 cache shared by all wavefronts 8727 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 8728 scalar operations are used in a restricted way so do not impact the memory 8729 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 8730* The vector and scalar memory operations use an L2 cache. 8731 8732 * The gfx940 can be configured as a number of smaller agents with each having 8733 a single L2 shared by all CUs on the same agent, or as fewer (possibly one) 8734 larger agents with groups of CUs on each agent each sharing separate L2 8735 caches. 8736 * The L2 cache has independent channels to service disjoint ranges of virtual 8737 addresses. 8738 * Each CU has a separate request queue per channel for its associated L2. 8739 Therefore, the vector and scalar memory operations performed by wavefronts 8740 executing with different L1 caches and the same L2 cache can be reordered 8741 relative to each other. 8742 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between 8743 vector memory operations of different CUs. It ensures a previous vector 8744 memory operation has completed before executing a subsequent vector memory 8745 or LDS operation and so can be used to meet the requirements of acquire and 8746 release. 8747 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW 8748 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with 8749 the PTE C-bit set for memory not local to the L2. 8750 8751 * Any local memory cache lines will be automatically invalidated by writes 8752 from CUs associated with other L2 caches, or writes from the CPU, due to 8753 the cache probe caused by the PTE C-bit. 8754 * XGMI accesses from the CPU to local memory may be cached on the CPU. 8755 Subsequent access from the GPU will automatically invalidate or writeback 8756 the CPU cache due to the L2 probe filter. 8757 * To ensure coherence of local memory writes of CUs with different L1 caches 8758 in the same agent a ``buffer_wbl2`` is required. It does nothing if the 8759 agent is configured to have a single L2, or will writeback dirty L2 cache 8760 lines if configured to have multiple L2 caches. 8761 * To ensure coherence of local memory writes of CUs in different agents a 8762 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines. 8763 * To ensure coherence of local memory reads of CUs with different L1 caches 8764 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the 8765 agent is configured to have a single L2, or will invalidate non-local L2 8766 cache lines if configured to have multiple L2 caches. 8767 * To ensure coherence of local memory reads of CUs in different agents a 8768 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache 8769 lines if configured to have multiple L2 caches. 8770 8771 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE 8772 UC (uncached) which bypasses the L2. 8773 8774Scalar memory operations are only used to access memory that is proven to not 8775change during the execution of the kernel dispatch. This includes constant 8776address space and global address space for program scope ``const`` variables. 8777Therefore, the kernel machine code does not have to maintain the scalar cache to 8778ensure it is coherent with the vector caches. The scalar and vector caches are 8779invalidated between kernel dispatches by CP since constant address space data 8780may change between kernel dispatch executions. See 8781:ref:`amdgpu-amdhsa-memory-spaces`. 8782 8783The one exception is if scalar writes are used to spill SGPR registers. In this 8784case the AMDGPU backend ensures the memory location used to spill is never 8785accessed by vector memory operations at the same time. If scalar writes are used 8786then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8787return since the locations may be used for vector memory instructions by a 8788future wavefront that uses the same scratch area, or a function call that 8789creates a frame at the same address, respectively. There is no need for a 8790``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8791 8792For kernarg backing memory: 8793 8794* CP invalidates the L1 cache at the start of each kernel dispatch. 8795* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 8796 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 8797 cache. This also causes it to be treated as non-volatile and so is not 8798 invalidated by ``*_vol``. 8799* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8800 so the L2 cache will be coherent with the CPU and other agents. 8801 8802Scratch backing memory (which is used for the private address space) is accessed 8803with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 8804only accessed by a single thread, and is always write-before-read, there is 8805never a need to invalidate these entries from the L1 cache. Hence all cache 8806invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 8807 8808The code sequences used to implement the memory model for GFX940 are defined 8809in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`. 8810 8811 .. table:: AMDHSA Memory Model Code Sequences GFX940 8812 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table 8813 8814 ============ ============ ============== ========== ================================ 8815 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8816 Ordering Sync Scope Address GFX940 8817 Space 8818 ============ ============ ============== ========== ================================ 8819 **Non-Atomic** 8820 ------------------------------------------------------------------------------------ 8821 load *none* *none* - global - !volatile & !nontemporal 8822 - generic 8823 - private 1. buffer/global/flat_load 8824 - constant 8825 - !volatile & nontemporal 8826 8827 1. buffer/global/flat_load 8828 nt=1 8829 8830 - volatile 8831 8832 1. buffer/global/flat_load 8833 sc0=1 sc1=1 8834 2. s_waitcnt vmcnt(0) 8835 8836 - Must happen before 8837 any following volatile 8838 global/generic 8839 load/store. 8840 - Ensures that 8841 volatile 8842 operations to 8843 different 8844 addresses will not 8845 be reordered by 8846 hardware. 8847 8848 load *none* *none* - local 1. ds_load 8849 store *none* *none* - global - !volatile & !nontemporal 8850 - generic 8851 - private 1. buffer/global/flat_store 8852 - constant 8853 - !volatile & nontemporal 8854 8855 1. buffer/global/flat_store 8856 nt=1 8857 8858 - volatile 8859 8860 1. buffer/global/flat_store 8861 sc0=1 sc1=1 8862 2. s_waitcnt vmcnt(0) 8863 8864 - Must happen before 8865 any following volatile 8866 global/generic 8867 load/store. 8868 - Ensures that 8869 volatile 8870 operations to 8871 different 8872 addresses will not 8873 be reordered by 8874 hardware. 8875 8876 store *none* *none* - local 1. ds_store 8877 **Unordered Atomic** 8878 ------------------------------------------------------------------------------------ 8879 load atomic unordered *any* *any* *Same as non-atomic*. 8880 store atomic unordered *any* *any* *Same as non-atomic*. 8881 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8882 **Monotonic Atomic** 8883 ------------------------------------------------------------------------------------ 8884 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8885 - wavefront - generic 8886 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8887 - generic sc0=1 8888 load atomic monotonic - singlethread - local *If TgSplit execution mode, 8889 - wavefront local address space cannot 8890 - workgroup be used.* 8891 8892 1. ds_load 8893 load atomic monotonic - agent - global 1. buffer/global/flat_load 8894 - generic sc1=1 8895 load atomic monotonic - system - global 1. buffer/global/flat_load 8896 - generic sc0=1 sc1=1 8897 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8898 - wavefront - generic 8899 store atomic monotonic - workgroup - global 1. buffer/global/flat_store 8900 - generic sc0=1 8901 store atomic monotonic - agent - global 1. buffer/global/flat_store 8902 - generic sc1=1 8903 store atomic monotonic - system - global 1. buffer/global/flat_store 8904 - generic sc0=1 sc1=1 8905 store atomic monotonic - singlethread - local *If TgSplit execution mode, 8906 - wavefront local address space cannot 8907 - workgroup be used.* 8908 8909 1. ds_store 8910 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8911 - wavefront - generic 8912 - workgroup 8913 - agent 8914 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 8915 - generic sc1=1 8916 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 8917 - wavefront local address space cannot 8918 - workgroup be used.* 8919 8920 1. ds_atomic 8921 **Acquire Atomic** 8922 ------------------------------------------------------------------------------------ 8923 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 8924 - wavefront - local 8925 - generic 8926 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1 8927 2. s_waitcnt vmcnt(0) 8928 8929 - If not TgSplit execution 8930 mode, omit. 8931 - Must happen before the 8932 following buffer_inv. 8933 8934 3. buffer_inv sc0=1 8935 8936 - If not TgSplit execution 8937 mode, omit. 8938 - Must happen before 8939 any following 8940 global/generic 8941 load/load 8942 atomic/store/store 8943 atomic/atomicrmw. 8944 - Ensures that 8945 following 8946 loads will not see 8947 stale data. 8948 8949 load atomic acquire - workgroup - local *If TgSplit execution mode, 8950 local address space cannot 8951 be used.* 8952 8953 1. ds_load 8954 2. s_waitcnt lgkmcnt(0) 8955 8956 - If OpenCL, omit. 8957 - Must happen before 8958 any following 8959 global/generic 8960 load/load 8961 atomic/store/store 8962 atomic/atomicrmw. 8963 - Ensures any 8964 following global 8965 data read is no 8966 older than the local load 8967 atomic value being 8968 acquired. 8969 8970 load atomic acquire - workgroup - generic 1. flat_load sc0=1 8971 2. s_waitcnt lgkm/vmcnt(0) 8972 8973 - Use lgkmcnt(0) if not 8974 TgSplit execution mode 8975 and vmcnt(0) if TgSplit 8976 execution mode. 8977 - If OpenCL, omit lgkmcnt(0). 8978 - Must happen before 8979 the following 8980 buffer_inv and any 8981 following global/generic 8982 load/load 8983 atomic/store/store 8984 atomic/atomicrmw. 8985 - Ensures any 8986 following global 8987 data read is no 8988 older than a local load 8989 atomic value being 8990 acquired. 8991 8992 3. buffer_inv sc0=1 8993 8994 - If not TgSplit execution 8995 mode, omit. 8996 - Ensures that 8997 following 8998 loads will not see 8999 stale data. 9000 9001 load atomic acquire - agent - global 1. buffer/global_load 9002 sc1=1 9003 2. s_waitcnt vmcnt(0) 9004 9005 - Must happen before 9006 following 9007 buffer_inv. 9008 - Ensures the load 9009 has completed 9010 before invalidating 9011 the cache. 9012 9013 3. buffer_inv sc1=1 9014 9015 - Must happen before 9016 any following 9017 global/generic 9018 load/load 9019 atomic/atomicrmw. 9020 - Ensures that 9021 following 9022 loads will not see 9023 stale global data. 9024 9025 load atomic acquire - system - global 1. buffer/global/flat_load 9026 sc0=1 sc1=1 9027 2. s_waitcnt vmcnt(0) 9028 9029 - Must happen before 9030 following 9031 buffer_inv. 9032 - Ensures the load 9033 has completed 9034 before invalidating 9035 the cache. 9036 9037 3. buffer_inv sc0=1 sc1=1 9038 9039 - Must happen before 9040 any following 9041 global/generic 9042 load/load 9043 atomic/atomicrmw. 9044 - Ensures that 9045 following 9046 loads will not see 9047 stale MTYPE NC global data. 9048 MTYPE RW and CC memory will 9049 never be stale due to the 9050 memory probes. 9051 9052 load atomic acquire - agent - generic 1. flat_load sc1=1 9053 2. s_waitcnt vmcnt(0) & 9054 lgkmcnt(0) 9055 9056 - If TgSplit execution mode, 9057 omit lgkmcnt(0). 9058 - If OpenCL omit 9059 lgkmcnt(0). 9060 - Must happen before 9061 following 9062 buffer_inv. 9063 - Ensures the flat_load 9064 has completed 9065 before invalidating 9066 the cache. 9067 9068 3. buffer_inv sc1=1 9069 9070 - Must happen before 9071 any following 9072 global/generic 9073 load/load 9074 atomic/atomicrmw. 9075 - Ensures that 9076 following loads 9077 will not see stale 9078 global data. 9079 9080 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1 9081 2. s_waitcnt vmcnt(0) & 9082 lgkmcnt(0) 9083 9084 - If TgSplit execution mode, 9085 omit lgkmcnt(0). 9086 - If OpenCL omit 9087 lgkmcnt(0). 9088 - Must happen before 9089 the following 9090 buffer_inv. 9091 - Ensures the flat_load 9092 has completed 9093 before invalidating 9094 the caches. 9095 9096 3. buffer_inv sc0=1 sc1=1 9097 9098 - Must happen before 9099 any following 9100 global/generic 9101 load/load 9102 atomic/atomicrmw. 9103 - Ensures that 9104 following 9105 loads will not see 9106 stale MTYPE NC global data. 9107 MTYPE RW and CC memory will 9108 never be stale due to the 9109 memory probes. 9110 9111 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 9112 - wavefront - generic 9113 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 9114 - wavefront local address space cannot 9115 be used.* 9116 9117 1. ds_atomic 9118 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 9119 2. s_waitcnt vmcnt(0) 9120 9121 - If not TgSplit execution 9122 mode, omit. 9123 - Must happen before the 9124 following buffer_inv. 9125 - Ensures the atomicrmw 9126 has completed 9127 before invalidating 9128 the cache. 9129 9130 3. buffer_inv sc0=1 9131 9132 - If not TgSplit execution 9133 mode, omit. 9134 - Must happen before 9135 any following 9136 global/generic 9137 load/load 9138 atomic/atomicrmw. 9139 - Ensures that 9140 following loads 9141 will not see stale 9142 global data. 9143 9144 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 9145 local address space cannot 9146 be used.* 9147 9148 1. ds_atomic 9149 2. s_waitcnt lgkmcnt(0) 9150 9151 - If OpenCL, omit. 9152 - Must happen before 9153 any following 9154 global/generic 9155 load/load 9156 atomic/store/store 9157 atomic/atomicrmw. 9158 - Ensures any 9159 following global 9160 data read is no 9161 older than the local 9162 atomicrmw value 9163 being acquired. 9164 9165 atomicrmw acquire - workgroup - generic 1. flat_atomic 9166 2. s_waitcnt lgkm/vmcnt(0) 9167 9168 - Use lgkmcnt(0) if not 9169 TgSplit execution mode 9170 and vmcnt(0) if TgSplit 9171 execution mode. 9172 - If OpenCL, omit lgkmcnt(0). 9173 - Must happen before 9174 the following 9175 buffer_inv and 9176 any following 9177 global/generic 9178 load/load 9179 atomic/store/store 9180 atomic/atomicrmw. 9181 - Ensures any 9182 following global 9183 data read is no 9184 older than a local 9185 atomicrmw value 9186 being acquired. 9187 9188 3. buffer_inv sc0=1 9189 9190 - If not TgSplit execution 9191 mode, omit. 9192 - Ensures that 9193 following 9194 loads will not see 9195 stale data. 9196 9197 atomicrmw acquire - agent - global 1. buffer/global_atomic 9198 2. s_waitcnt vmcnt(0) 9199 9200 - Must happen before 9201 following 9202 buffer_inv. 9203 - Ensures the 9204 atomicrmw has 9205 completed before 9206 invalidating the 9207 cache. 9208 9209 3. buffer_inv sc1=1 9210 9211 - Must happen before 9212 any following 9213 global/generic 9214 load/load 9215 atomic/atomicrmw. 9216 - Ensures that 9217 following loads 9218 will not see stale 9219 global data. 9220 9221 atomicrmw acquire - system - global 1. buffer/global_atomic 9222 sc1=1 9223 2. s_waitcnt vmcnt(0) 9224 9225 - Must happen before 9226 following 9227 buffer_inv. 9228 - Ensures the 9229 atomicrmw has 9230 completed before 9231 invalidating the 9232 caches. 9233 9234 3. buffer_inv sc0=1 sc1=1 9235 9236 - Must happen before 9237 any following 9238 global/generic 9239 load/load 9240 atomic/atomicrmw. 9241 - Ensures that 9242 following 9243 loads will not see 9244 stale MTYPE NC global data. 9245 MTYPE RW and CC memory will 9246 never be stale due to the 9247 memory probes. 9248 9249 atomicrmw acquire - agent - generic 1. flat_atomic 9250 2. s_waitcnt vmcnt(0) & 9251 lgkmcnt(0) 9252 9253 - If TgSplit execution mode, 9254 omit lgkmcnt(0). 9255 - If OpenCL, omit 9256 lgkmcnt(0). 9257 - Must happen before 9258 following 9259 buffer_inv. 9260 - Ensures the 9261 atomicrmw has 9262 completed before 9263 invalidating the 9264 cache. 9265 9266 3. buffer_inv sc1=1 9267 9268 - Must happen before 9269 any following 9270 global/generic 9271 load/load 9272 atomic/atomicrmw. 9273 - Ensures that 9274 following loads 9275 will not see stale 9276 global data. 9277 9278 atomicrmw acquire - system - generic 1. flat_atomic sc1=1 9279 2. s_waitcnt vmcnt(0) & 9280 lgkmcnt(0) 9281 9282 - If TgSplit execution mode, 9283 omit lgkmcnt(0). 9284 - If OpenCL, omit 9285 lgkmcnt(0). 9286 - Must happen before 9287 following 9288 buffer_inv. 9289 - Ensures the 9290 atomicrmw has 9291 completed before 9292 invalidating the 9293 caches. 9294 9295 3. buffer_inv sc0=1 sc1=1 9296 9297 - Must happen before 9298 any following 9299 global/generic 9300 load/load 9301 atomic/atomicrmw. 9302 - Ensures that 9303 following 9304 loads will not see 9305 stale MTYPE NC global data. 9306 MTYPE RW and CC memory will 9307 never be stale due to the 9308 memory probes. 9309 9310 fence acquire - singlethread *none* *none* 9311 - wavefront 9312 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9313 9314 - Use lgkmcnt(0) if not 9315 TgSplit execution mode 9316 and vmcnt(0) if TgSplit 9317 execution mode. 9318 - If OpenCL and 9319 address space is 9320 not generic, omit 9321 lgkmcnt(0). 9322 - If OpenCL and 9323 address space is 9324 local, omit 9325 vmcnt(0). 9326 - However, since LLVM 9327 currently has no 9328 address space on 9329 the fence need to 9330 conservatively 9331 always generate. If 9332 fence had an 9333 address space then 9334 set to address 9335 space of OpenCL 9336 fence flag, or to 9337 generic if both 9338 local and global 9339 flags are 9340 specified. 9341 - s_waitcnt vmcnt(0) 9342 must happen after 9343 any preceding 9344 global/generic load 9345 atomic/ 9346 atomicrmw 9347 with an equal or 9348 wider sync scope 9349 and memory ordering 9350 stronger than 9351 unordered (this is 9352 termed the 9353 fence-paired-atomic). 9354 - s_waitcnt lgkmcnt(0) 9355 must happen after 9356 any preceding 9357 local/generic load 9358 atomic/atomicrmw 9359 with an equal or 9360 wider sync scope 9361 and memory ordering 9362 stronger than 9363 unordered (this is 9364 termed the 9365 fence-paired-atomic). 9366 - Must happen before 9367 the following 9368 buffer_inv and 9369 any following 9370 global/generic 9371 load/load 9372 atomic/store/store 9373 atomic/atomicrmw. 9374 - Ensures any 9375 following global 9376 data read is no 9377 older than the 9378 value read by the 9379 fence-paired-atomic. 9380 9381 3. buffer_inv sc0=1 9382 9383 - If not TgSplit execution 9384 mode, omit. 9385 - Ensures that 9386 following 9387 loads will not see 9388 stale data. 9389 9390 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 9391 vmcnt(0) 9392 9393 - If TgSplit execution mode, 9394 omit lgkmcnt(0). 9395 - If OpenCL and 9396 address space is 9397 not generic, omit 9398 lgkmcnt(0). 9399 - However, since LLVM 9400 currently has no 9401 address space on 9402 the fence need to 9403 conservatively 9404 always generate 9405 (see comment for 9406 previous fence). 9407 - Could be split into 9408 separate s_waitcnt 9409 vmcnt(0) and 9410 s_waitcnt 9411 lgkmcnt(0) to allow 9412 them to be 9413 independently moved 9414 according to the 9415 following rules. 9416 - s_waitcnt vmcnt(0) 9417 must happen after 9418 any preceding 9419 global/generic load 9420 atomic/atomicrmw 9421 with an equal or 9422 wider sync scope 9423 and memory ordering 9424 stronger than 9425 unordered (this is 9426 termed the 9427 fence-paired-atomic). 9428 - s_waitcnt lgkmcnt(0) 9429 must happen after 9430 any preceding 9431 local/generic load 9432 atomic/atomicrmw 9433 with an equal or 9434 wider sync scope 9435 and memory ordering 9436 stronger than 9437 unordered (this is 9438 termed the 9439 fence-paired-atomic). 9440 - Must happen before 9441 the following 9442 buffer_inv. 9443 - Ensures that the 9444 fence-paired atomic 9445 has completed 9446 before invalidating 9447 the 9448 cache. Therefore 9449 any following 9450 locations read must 9451 be no older than 9452 the value read by 9453 the 9454 fence-paired-atomic. 9455 9456 2. buffer_inv sc1=1 9457 9458 - Must happen before any 9459 following global/generic 9460 load/load 9461 atomic/store/store 9462 atomic/atomicrmw. 9463 - Ensures that 9464 following loads 9465 will not see stale 9466 global data. 9467 9468 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 9469 vmcnt(0) 9470 9471 - If TgSplit execution mode, 9472 omit lgkmcnt(0). 9473 - If OpenCL and 9474 address space is 9475 not generic, omit 9476 lgkmcnt(0). 9477 - However, since LLVM 9478 currently has no 9479 address space on 9480 the fence need to 9481 conservatively 9482 always generate 9483 (see comment for 9484 previous fence). 9485 - Could be split into 9486 separate s_waitcnt 9487 vmcnt(0) and 9488 s_waitcnt 9489 lgkmcnt(0) to allow 9490 them to be 9491 independently moved 9492 according to the 9493 following rules. 9494 - s_waitcnt vmcnt(0) 9495 must happen after 9496 any preceding 9497 global/generic load 9498 atomic/atomicrmw 9499 with an equal or 9500 wider sync scope 9501 and memory ordering 9502 stronger than 9503 unordered (this is 9504 termed the 9505 fence-paired-atomic). 9506 - s_waitcnt lgkmcnt(0) 9507 must happen after 9508 any preceding 9509 local/generic load 9510 atomic/atomicrmw 9511 with an equal or 9512 wider sync scope 9513 and memory ordering 9514 stronger than 9515 unordered (this is 9516 termed the 9517 fence-paired-atomic). 9518 - Must happen before 9519 the following 9520 buffer_inv. 9521 - Ensures that the 9522 fence-paired atomic 9523 has completed 9524 before invalidating 9525 the 9526 cache. Therefore 9527 any following 9528 locations read must 9529 be no older than 9530 the value read by 9531 the 9532 fence-paired-atomic. 9533 9534 2. buffer_inv sc0=1 sc1=1 9535 9536 - Must happen before any 9537 following global/generic 9538 load/load 9539 atomic/store/store 9540 atomic/atomicrmw. 9541 - Ensures that 9542 following loads 9543 will not see stale 9544 global data. 9545 9546 **Release Atomic** 9547 ------------------------------------------------------------------------------------ 9548 store atomic release - singlethread - global 1. buffer/global/flat_store 9549 - wavefront - generic 9550 store atomic release - singlethread - local *If TgSplit execution mode, 9551 - wavefront local address space cannot 9552 be used.* 9553 9554 1. ds_store 9555 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9556 - generic 9557 - Use lgkmcnt(0) if not 9558 TgSplit execution mode 9559 and vmcnt(0) if TgSplit 9560 execution mode. 9561 - If OpenCL, omit lgkmcnt(0). 9562 - s_waitcnt vmcnt(0) 9563 must happen after 9564 any preceding 9565 global/generic load/store/ 9566 load atomic/store atomic/ 9567 atomicrmw. 9568 - s_waitcnt lgkmcnt(0) 9569 must happen after 9570 any preceding 9571 local/generic 9572 load/store/load 9573 atomic/store 9574 atomic/atomicrmw. 9575 - Must happen before 9576 the following 9577 store. 9578 - Ensures that all 9579 memory operations 9580 have 9581 completed before 9582 performing the 9583 store that is being 9584 released. 9585 9586 2. buffer/global/flat_store sc0=1 9587 store atomic release - workgroup - local *If TgSplit execution mode, 9588 local address space cannot 9589 be used.* 9590 9591 1. ds_store 9592 store atomic release - agent - global 1. buffer_wbl2 sc1=1 9593 - generic 9594 - Must happen before 9595 following s_waitcnt. 9596 - Performs L2 writeback to 9597 ensure previous 9598 global/generic 9599 store/atomicrmw are 9600 visible at agent scope. 9601 9602 2. s_waitcnt lgkmcnt(0) & 9603 vmcnt(0) 9604 9605 - If TgSplit execution mode, 9606 omit lgkmcnt(0). 9607 - If OpenCL and 9608 address space is 9609 not generic, omit 9610 lgkmcnt(0). 9611 - Could be split into 9612 separate s_waitcnt 9613 vmcnt(0) and 9614 s_waitcnt 9615 lgkmcnt(0) to allow 9616 them to be 9617 independently moved 9618 according to the 9619 following rules. 9620 - s_waitcnt vmcnt(0) 9621 must happen after 9622 any preceding 9623 global/generic 9624 load/store/load 9625 atomic/store 9626 atomic/atomicrmw. 9627 - s_waitcnt lgkmcnt(0) 9628 must happen after 9629 any preceding 9630 local/generic 9631 load/store/load 9632 atomic/store 9633 atomic/atomicrmw. 9634 - Must happen before 9635 the following 9636 store. 9637 - Ensures that all 9638 memory operations 9639 to memory have 9640 completed before 9641 performing the 9642 store that is being 9643 released. 9644 9645 3. buffer/global/flat_store sc1=1 9646 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9647 - generic 9648 - Must happen before 9649 following s_waitcnt. 9650 - Performs L2 writeback to 9651 ensure previous 9652 global/generic 9653 store/atomicrmw are 9654 visible at system scope. 9655 9656 2. s_waitcnt lgkmcnt(0) & 9657 vmcnt(0) 9658 9659 - If TgSplit execution mode, 9660 omit lgkmcnt(0). 9661 - If OpenCL and 9662 address space is 9663 not generic, omit 9664 lgkmcnt(0). 9665 - Could be split into 9666 separate s_waitcnt 9667 vmcnt(0) and 9668 s_waitcnt 9669 lgkmcnt(0) to allow 9670 them to be 9671 independently moved 9672 according to the 9673 following rules. 9674 - s_waitcnt vmcnt(0) 9675 must happen after any 9676 preceding 9677 global/generic 9678 load/store/load 9679 atomic/store 9680 atomic/atomicrmw. 9681 - s_waitcnt lgkmcnt(0) 9682 must happen after any 9683 preceding 9684 local/generic 9685 load/store/load 9686 atomic/store 9687 atomic/atomicrmw. 9688 - Must happen before 9689 the following 9690 store. 9691 - Ensures that all 9692 memory operations 9693 to memory and the L2 9694 writeback have 9695 completed before 9696 performing the 9697 store that is being 9698 released. 9699 9700 3. buffer/global/flat_store 9701 sc0=1 sc1=1 9702 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 9703 - wavefront - generic 9704 atomicrmw release - singlethread - local *If TgSplit execution mode, 9705 - wavefront local address space cannot 9706 be used.* 9707 9708 1. ds_atomic 9709 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 9710 - generic 9711 - Use lgkmcnt(0) if not 9712 TgSplit execution mode 9713 and vmcnt(0) if TgSplit 9714 execution mode. 9715 - If OpenCL, omit 9716 lgkmcnt(0). 9717 - s_waitcnt vmcnt(0) 9718 must happen after 9719 any preceding 9720 global/generic load/store/ 9721 load atomic/store atomic/ 9722 atomicrmw. 9723 - s_waitcnt lgkmcnt(0) 9724 must happen after 9725 any preceding 9726 local/generic 9727 load/store/load 9728 atomic/store 9729 atomic/atomicrmw. 9730 - Must happen before 9731 the following 9732 atomicrmw. 9733 - Ensures that all 9734 memory operations 9735 have 9736 completed before 9737 performing the 9738 atomicrmw that is 9739 being released. 9740 9741 2. buffer/global/flat_atomic sc0=1 9742 atomicrmw release - workgroup - local *If TgSplit execution mode, 9743 local address space cannot 9744 be used.* 9745 9746 1. ds_atomic 9747 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1 9748 - generic 9749 - Must happen before 9750 following s_waitcnt. 9751 - Performs L2 writeback to 9752 ensure previous 9753 global/generic 9754 store/atomicrmw are 9755 visible at agent scope. 9756 9757 2. s_waitcnt lgkmcnt(0) & 9758 vmcnt(0) 9759 9760 - If TgSplit execution mode, 9761 omit lgkmcnt(0). 9762 - If OpenCL, omit 9763 lgkmcnt(0). 9764 - Could be split into 9765 separate s_waitcnt 9766 vmcnt(0) and 9767 s_waitcnt 9768 lgkmcnt(0) to allow 9769 them to be 9770 independently moved 9771 according to the 9772 following rules. 9773 - s_waitcnt vmcnt(0) 9774 must happen after 9775 any preceding 9776 global/generic 9777 load/store/load 9778 atomic/store 9779 atomic/atomicrmw. 9780 - s_waitcnt lgkmcnt(0) 9781 must happen after 9782 any preceding 9783 local/generic 9784 load/store/load 9785 atomic/store 9786 atomic/atomicrmw. 9787 - Must happen before 9788 the following 9789 atomicrmw. 9790 - Ensures that all 9791 memory operations 9792 to global and local 9793 have completed 9794 before performing 9795 the atomicrmw that 9796 is being released. 9797 9798 3. buffer/global/flat_atomic sc1=1 9799 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1 9800 - generic 9801 - Must happen before 9802 following s_waitcnt. 9803 - Performs L2 writeback to 9804 ensure previous 9805 global/generic 9806 store/atomicrmw are 9807 visible at system scope. 9808 9809 2. s_waitcnt lgkmcnt(0) & 9810 vmcnt(0) 9811 9812 - If TgSplit execution mode, 9813 omit lgkmcnt(0). 9814 - If OpenCL, omit 9815 lgkmcnt(0). 9816 - Could be split into 9817 separate s_waitcnt 9818 vmcnt(0) and 9819 s_waitcnt 9820 lgkmcnt(0) to allow 9821 them to be 9822 independently moved 9823 according to the 9824 following rules. 9825 - s_waitcnt vmcnt(0) 9826 must happen after 9827 any preceding 9828 global/generic 9829 load/store/load 9830 atomic/store 9831 atomic/atomicrmw. 9832 - s_waitcnt lgkmcnt(0) 9833 must happen after 9834 any preceding 9835 local/generic 9836 load/store/load 9837 atomic/store 9838 atomic/atomicrmw. 9839 - Must happen before 9840 the following 9841 atomicrmw. 9842 - Ensures that all 9843 memory operations 9844 to memory and the L2 9845 writeback have 9846 completed before 9847 performing the 9848 store that is being 9849 released. 9850 9851 3. buffer/global/flat_atomic 9852 sc0=1 sc1=1 9853 fence release - singlethread *none* *none* 9854 - wavefront 9855 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 9856 9857 - Use lgkmcnt(0) if not 9858 TgSplit execution mode 9859 and vmcnt(0) if TgSplit 9860 execution mode. 9861 - If OpenCL and 9862 address space is 9863 not generic, omit 9864 lgkmcnt(0). 9865 - If OpenCL and 9866 address space is 9867 local, omit 9868 vmcnt(0). 9869 - However, since LLVM 9870 currently has no 9871 address space on 9872 the fence need to 9873 conservatively 9874 always generate. If 9875 fence had an 9876 address space then 9877 set to address 9878 space of OpenCL 9879 fence flag, or to 9880 generic if both 9881 local and global 9882 flags are 9883 specified. 9884 - s_waitcnt vmcnt(0) 9885 must happen after 9886 any preceding 9887 global/generic 9888 load/store/ 9889 load atomic/store atomic/ 9890 atomicrmw. 9891 - s_waitcnt lgkmcnt(0) 9892 must happen after 9893 any preceding 9894 local/generic 9895 load/load 9896 atomic/store/store 9897 atomic/atomicrmw. 9898 - Must happen before 9899 any following store 9900 atomic/atomicrmw 9901 with an equal or 9902 wider sync scope 9903 and memory ordering 9904 stronger than 9905 unordered (this is 9906 termed the 9907 fence-paired-atomic). 9908 - Ensures that all 9909 memory operations 9910 have 9911 completed before 9912 performing the 9913 following 9914 fence-paired-atomic. 9915 9916 fence release - agent *none* 1. buffer_wbl2 sc1=1 9917 9918 - If OpenCL and 9919 address space is 9920 local, omit. 9921 - Must happen before 9922 following s_waitcnt. 9923 - Performs L2 writeback to 9924 ensure previous 9925 global/generic 9926 store/atomicrmw are 9927 visible at agent scope. 9928 9929 2. s_waitcnt lgkmcnt(0) & 9930 vmcnt(0) 9931 9932 - If TgSplit execution mode, 9933 omit lgkmcnt(0). 9934 - If OpenCL and 9935 address space is 9936 not generic, omit 9937 lgkmcnt(0). 9938 - If OpenCL and 9939 address space is 9940 local, omit 9941 vmcnt(0). 9942 - However, since LLVM 9943 currently has no 9944 address space on 9945 the fence need to 9946 conservatively 9947 always generate. If 9948 fence had an 9949 address space then 9950 set to address 9951 space of OpenCL 9952 fence flag, or to 9953 generic if both 9954 local and global 9955 flags are 9956 specified. 9957 - Could be split into 9958 separate s_waitcnt 9959 vmcnt(0) and 9960 s_waitcnt 9961 lgkmcnt(0) to allow 9962 them to be 9963 independently moved 9964 according to the 9965 following rules. 9966 - s_waitcnt vmcnt(0) 9967 must happen after 9968 any preceding 9969 global/generic 9970 load/store/load 9971 atomic/store 9972 atomic/atomicrmw. 9973 - s_waitcnt lgkmcnt(0) 9974 must happen after 9975 any preceding 9976 local/generic 9977 load/store/load 9978 atomic/store 9979 atomic/atomicrmw. 9980 - Must happen before 9981 any following store 9982 atomic/atomicrmw 9983 with an equal or 9984 wider sync scope 9985 and memory ordering 9986 stronger than 9987 unordered (this is 9988 termed the 9989 fence-paired-atomic). 9990 - Ensures that all 9991 memory operations 9992 have 9993 completed before 9994 performing the 9995 following 9996 fence-paired-atomic. 9997 9998 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1 9999 10000 - Must happen before 10001 following s_waitcnt. 10002 - Performs L2 writeback to 10003 ensure previous 10004 global/generic 10005 store/atomicrmw are 10006 visible at system scope. 10007 10008 2. s_waitcnt lgkmcnt(0) & 10009 vmcnt(0) 10010 10011 - If TgSplit execution mode, 10012 omit lgkmcnt(0). 10013 - If OpenCL and 10014 address space is 10015 not generic, omit 10016 lgkmcnt(0). 10017 - If OpenCL and 10018 address space is 10019 local, omit 10020 vmcnt(0). 10021 - However, since LLVM 10022 currently has no 10023 address space on 10024 the fence need to 10025 conservatively 10026 always generate. If 10027 fence had an 10028 address space then 10029 set to address 10030 space of OpenCL 10031 fence flag, or to 10032 generic if both 10033 local and global 10034 flags are 10035 specified. 10036 - Could be split into 10037 separate s_waitcnt 10038 vmcnt(0) and 10039 s_waitcnt 10040 lgkmcnt(0) to allow 10041 them to be 10042 independently moved 10043 according to the 10044 following rules. 10045 - s_waitcnt vmcnt(0) 10046 must happen after 10047 any preceding 10048 global/generic 10049 load/store/load 10050 atomic/store 10051 atomic/atomicrmw. 10052 - s_waitcnt lgkmcnt(0) 10053 must happen after 10054 any preceding 10055 local/generic 10056 load/store/load 10057 atomic/store 10058 atomic/atomicrmw. 10059 - Must happen before 10060 any following store 10061 atomic/atomicrmw 10062 with an equal or 10063 wider sync scope 10064 and memory ordering 10065 stronger than 10066 unordered (this is 10067 termed the 10068 fence-paired-atomic). 10069 - Ensures that all 10070 memory operations 10071 have 10072 completed before 10073 performing the 10074 following 10075 fence-paired-atomic. 10076 10077 **Acquire-Release Atomic** 10078 ------------------------------------------------------------------------------------ 10079 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 10080 - wavefront - generic 10081 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 10082 - wavefront local address space cannot 10083 be used.* 10084 10085 1. ds_atomic 10086 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10087 10088 - Use lgkmcnt(0) if not 10089 TgSplit execution mode 10090 and vmcnt(0) if TgSplit 10091 execution mode. 10092 - If OpenCL, omit 10093 lgkmcnt(0). 10094 - Must happen after 10095 any preceding 10096 local/generic 10097 load/store/load 10098 atomic/store 10099 atomic/atomicrmw. 10100 - s_waitcnt vmcnt(0) 10101 must happen after 10102 any preceding 10103 global/generic load/store/ 10104 load atomic/store atomic/ 10105 atomicrmw. 10106 - s_waitcnt lgkmcnt(0) 10107 must happen after 10108 any preceding 10109 local/generic 10110 load/store/load 10111 atomic/store 10112 atomic/atomicrmw. 10113 - Must happen before 10114 the following 10115 atomicrmw. 10116 - Ensures that all 10117 memory operations 10118 have 10119 completed before 10120 performing the 10121 atomicrmw that is 10122 being released. 10123 10124 2. buffer/global_atomic 10125 3. s_waitcnt vmcnt(0) 10126 10127 - If not TgSplit execution 10128 mode, omit. 10129 - Must happen before 10130 the following 10131 buffer_inv. 10132 - Ensures any 10133 following global 10134 data read is no 10135 older than the 10136 atomicrmw value 10137 being acquired. 10138 10139 4. buffer_inv sc0=1 10140 10141 - If not TgSplit execution 10142 mode, omit. 10143 - Ensures that 10144 following 10145 loads will not see 10146 stale data. 10147 10148 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 10149 local address space cannot 10150 be used.* 10151 10152 1. ds_atomic 10153 2. s_waitcnt lgkmcnt(0) 10154 10155 - If OpenCL, omit. 10156 - Must happen before 10157 any following 10158 global/generic 10159 load/load 10160 atomic/store/store 10161 atomic/atomicrmw. 10162 - Ensures any 10163 following global 10164 data read is no 10165 older than the local load 10166 atomic value being 10167 acquired. 10168 10169 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 10170 10171 - Use lgkmcnt(0) if not 10172 TgSplit execution mode 10173 and vmcnt(0) if TgSplit 10174 execution mode. 10175 - If OpenCL, omit 10176 lgkmcnt(0). 10177 - s_waitcnt vmcnt(0) 10178 must happen after 10179 any preceding 10180 global/generic load/store/ 10181 load atomic/store atomic/ 10182 atomicrmw. 10183 - s_waitcnt lgkmcnt(0) 10184 must happen after 10185 any preceding 10186 local/generic 10187 load/store/load 10188 atomic/store 10189 atomic/atomicrmw. 10190 - Must happen before 10191 the following 10192 atomicrmw. 10193 - Ensures that all 10194 memory operations 10195 have 10196 completed before 10197 performing the 10198 atomicrmw that is 10199 being released. 10200 10201 2. flat_atomic 10202 3. s_waitcnt lgkmcnt(0) & 10203 vmcnt(0) 10204 10205 - If not TgSplit execution 10206 mode, omit vmcnt(0). 10207 - If OpenCL, omit 10208 lgkmcnt(0). 10209 - Must happen before 10210 the following 10211 buffer_inv and 10212 any following 10213 global/generic 10214 load/load 10215 atomic/store/store 10216 atomic/atomicrmw. 10217 - Ensures any 10218 following global 10219 data read is no 10220 older than a local load 10221 atomic value being 10222 acquired. 10223 10224 3. buffer_inv sc0=1 10225 10226 - If not TgSplit execution 10227 mode, omit. 10228 - Ensures that 10229 following 10230 loads will not see 10231 stale data. 10232 10233 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1 10234 10235 - Must happen before 10236 following s_waitcnt. 10237 - Performs L2 writeback to 10238 ensure previous 10239 global/generic 10240 store/atomicrmw are 10241 visible at agent scope. 10242 10243 2. s_waitcnt lgkmcnt(0) & 10244 vmcnt(0) 10245 10246 - If TgSplit execution mode, 10247 omit lgkmcnt(0). 10248 - If OpenCL, omit 10249 lgkmcnt(0). 10250 - Could be split into 10251 separate s_waitcnt 10252 vmcnt(0) and 10253 s_waitcnt 10254 lgkmcnt(0) to allow 10255 them to be 10256 independently moved 10257 according to the 10258 following rules. 10259 - s_waitcnt vmcnt(0) 10260 must happen after 10261 any preceding 10262 global/generic 10263 load/store/load 10264 atomic/store 10265 atomic/atomicrmw. 10266 - s_waitcnt lgkmcnt(0) 10267 must happen after 10268 any preceding 10269 local/generic 10270 load/store/load 10271 atomic/store 10272 atomic/atomicrmw. 10273 - Must happen before 10274 the following 10275 atomicrmw. 10276 - Ensures that all 10277 memory operations 10278 to global have 10279 completed before 10280 performing the 10281 atomicrmw that is 10282 being released. 10283 10284 3. buffer/global_atomic 10285 4. s_waitcnt vmcnt(0) 10286 10287 - Must happen before 10288 following 10289 buffer_inv. 10290 - Ensures the 10291 atomicrmw has 10292 completed before 10293 invalidating the 10294 cache. 10295 10296 5. buffer_inv sc1=1 10297 10298 - Must happen before 10299 any following 10300 global/generic 10301 load/load 10302 atomic/atomicrmw. 10303 - Ensures that 10304 following loads 10305 will not see stale 10306 global data. 10307 10308 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1 10309 10310 - Must happen before 10311 following s_waitcnt. 10312 - Performs L2 writeback to 10313 ensure previous 10314 global/generic 10315 store/atomicrmw are 10316 visible at system scope. 10317 10318 2. s_waitcnt lgkmcnt(0) & 10319 vmcnt(0) 10320 10321 - If TgSplit execution mode, 10322 omit lgkmcnt(0). 10323 - If OpenCL, omit 10324 lgkmcnt(0). 10325 - Could be split into 10326 separate s_waitcnt 10327 vmcnt(0) and 10328 s_waitcnt 10329 lgkmcnt(0) to allow 10330 them to be 10331 independently moved 10332 according to the 10333 following rules. 10334 - s_waitcnt vmcnt(0) 10335 must happen after 10336 any preceding 10337 global/generic 10338 load/store/load 10339 atomic/store 10340 atomic/atomicrmw. 10341 - s_waitcnt lgkmcnt(0) 10342 must happen after 10343 any preceding 10344 local/generic 10345 load/store/load 10346 atomic/store 10347 atomic/atomicrmw. 10348 - Must happen before 10349 the following 10350 atomicrmw. 10351 - Ensures that all 10352 memory operations 10353 to global and L2 writeback 10354 have completed before 10355 performing the 10356 atomicrmw that is 10357 being released. 10358 10359 3. buffer/global_atomic 10360 sc1=1 10361 4. s_waitcnt vmcnt(0) 10362 10363 - Must happen before 10364 following 10365 buffer_inv. 10366 - Ensures the 10367 atomicrmw has 10368 completed before 10369 invalidating the 10370 caches. 10371 10372 5. buffer_inv sc0=1 sc1=1 10373 10374 - Must happen before 10375 any following 10376 global/generic 10377 load/load 10378 atomic/atomicrmw. 10379 - Ensures that 10380 following loads 10381 will not see stale 10382 MTYPE NC global data. 10383 MTYPE RW and CC memory will 10384 never be stale due to the 10385 memory probes. 10386 10387 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1 10388 10389 - Must happen before 10390 following s_waitcnt. 10391 - Performs L2 writeback to 10392 ensure previous 10393 global/generic 10394 store/atomicrmw are 10395 visible at agent scope. 10396 10397 2. s_waitcnt lgkmcnt(0) & 10398 vmcnt(0) 10399 10400 - If TgSplit execution mode, 10401 omit lgkmcnt(0). 10402 - If OpenCL, omit 10403 lgkmcnt(0). 10404 - Could be split into 10405 separate s_waitcnt 10406 vmcnt(0) and 10407 s_waitcnt 10408 lgkmcnt(0) to allow 10409 them to be 10410 independently moved 10411 according to the 10412 following rules. 10413 - s_waitcnt vmcnt(0) 10414 must happen after 10415 any preceding 10416 global/generic 10417 load/store/load 10418 atomic/store 10419 atomic/atomicrmw. 10420 - s_waitcnt lgkmcnt(0) 10421 must happen after 10422 any preceding 10423 local/generic 10424 load/store/load 10425 atomic/store 10426 atomic/atomicrmw. 10427 - Must happen before 10428 the following 10429 atomicrmw. 10430 - Ensures that all 10431 memory operations 10432 to global have 10433 completed before 10434 performing the 10435 atomicrmw that is 10436 being released. 10437 10438 3. flat_atomic 10439 4. s_waitcnt vmcnt(0) & 10440 lgkmcnt(0) 10441 10442 - If TgSplit execution mode, 10443 omit lgkmcnt(0). 10444 - If OpenCL, omit 10445 lgkmcnt(0). 10446 - Must happen before 10447 following 10448 buffer_inv. 10449 - Ensures the 10450 atomicrmw has 10451 completed before 10452 invalidating the 10453 cache. 10454 10455 5. buffer_inv sc1=1 10456 10457 - Must happen before 10458 any following 10459 global/generic 10460 load/load 10461 atomic/atomicrmw. 10462 - Ensures that 10463 following loads 10464 will not see stale 10465 global data. 10466 10467 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1 10468 10469 - Must happen before 10470 following s_waitcnt. 10471 - Performs L2 writeback to 10472 ensure previous 10473 global/generic 10474 store/atomicrmw are 10475 visible at system scope. 10476 10477 2. s_waitcnt lgkmcnt(0) & 10478 vmcnt(0) 10479 10480 - If TgSplit execution mode, 10481 omit lgkmcnt(0). 10482 - If OpenCL, omit 10483 lgkmcnt(0). 10484 - Could be split into 10485 separate s_waitcnt 10486 vmcnt(0) and 10487 s_waitcnt 10488 lgkmcnt(0) to allow 10489 them to be 10490 independently moved 10491 according to the 10492 following rules. 10493 - s_waitcnt vmcnt(0) 10494 must happen after 10495 any preceding 10496 global/generic 10497 load/store/load 10498 atomic/store 10499 atomic/atomicrmw. 10500 - s_waitcnt lgkmcnt(0) 10501 must happen after 10502 any preceding 10503 local/generic 10504 load/store/load 10505 atomic/store 10506 atomic/atomicrmw. 10507 - Must happen before 10508 the following 10509 atomicrmw. 10510 - Ensures that all 10511 memory operations 10512 to global and L2 writeback 10513 have completed before 10514 performing the 10515 atomicrmw that is 10516 being released. 10517 10518 3. flat_atomic sc1=1 10519 4. s_waitcnt vmcnt(0) & 10520 lgkmcnt(0) 10521 10522 - If TgSplit execution mode, 10523 omit lgkmcnt(0). 10524 - If OpenCL, omit 10525 lgkmcnt(0). 10526 - Must happen before 10527 following 10528 buffer_inv. 10529 - Ensures the 10530 atomicrmw has 10531 completed before 10532 invalidating the 10533 caches. 10534 10535 5. buffer_inv sc0=1 sc1=1 10536 10537 - Must happen before 10538 any following 10539 global/generic 10540 load/load 10541 atomic/atomicrmw. 10542 - Ensures that 10543 following loads 10544 will not see stale 10545 MTYPE NC global data. 10546 MTYPE RW and CC memory will 10547 never be stale due to the 10548 memory probes. 10549 10550 fence acq_rel - singlethread *none* *none* 10551 - wavefront 10552 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 10553 10554 - Use lgkmcnt(0) if not 10555 TgSplit execution mode 10556 and vmcnt(0) if TgSplit 10557 execution mode. 10558 - If OpenCL and 10559 address space is 10560 not generic, omit 10561 lgkmcnt(0). 10562 - If OpenCL and 10563 address space is 10564 local, omit 10565 vmcnt(0). 10566 - However, 10567 since LLVM 10568 currently has no 10569 address space on 10570 the fence need to 10571 conservatively 10572 always generate 10573 (see comment for 10574 previous fence). 10575 - s_waitcnt vmcnt(0) 10576 must happen after 10577 any preceding 10578 global/generic 10579 load/store/ 10580 load atomic/store atomic/ 10581 atomicrmw. 10582 - s_waitcnt lgkmcnt(0) 10583 must happen after 10584 any preceding 10585 local/generic 10586 load/load 10587 atomic/store/store 10588 atomic/atomicrmw. 10589 - Must happen before 10590 any following 10591 global/generic 10592 load/load 10593 atomic/store/store 10594 atomic/atomicrmw. 10595 - Ensures that all 10596 memory operations 10597 have 10598 completed before 10599 performing any 10600 following global 10601 memory operations. 10602 - Ensures that the 10603 preceding 10604 local/generic load 10605 atomic/atomicrmw 10606 with an equal or 10607 wider sync scope 10608 and memory ordering 10609 stronger than 10610 unordered (this is 10611 termed the 10612 acquire-fence-paired-atomic) 10613 has completed 10614 before following 10615 global memory 10616 operations. This 10617 satisfies the 10618 requirements of 10619 acquire. 10620 - Ensures that all 10621 previous memory 10622 operations have 10623 completed before a 10624 following 10625 local/generic store 10626 atomic/atomicrmw 10627 with an equal or 10628 wider sync scope 10629 and memory ordering 10630 stronger than 10631 unordered (this is 10632 termed the 10633 release-fence-paired-atomic). 10634 This satisfies the 10635 requirements of 10636 release. 10637 - Must happen before 10638 the following 10639 buffer_inv. 10640 - Ensures that the 10641 acquire-fence-paired 10642 atomic has completed 10643 before invalidating 10644 the 10645 cache. Therefore 10646 any following 10647 locations read must 10648 be no older than 10649 the value read by 10650 the 10651 acquire-fence-paired-atomic. 10652 10653 3. buffer_inv sc0=1 10654 10655 - If not TgSplit execution 10656 mode, omit. 10657 - Ensures that 10658 following 10659 loads will not see 10660 stale data. 10661 10662 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1 10663 10664 - If OpenCL and 10665 address space is 10666 local, omit. 10667 - Must happen before 10668 following s_waitcnt. 10669 - Performs L2 writeback to 10670 ensure previous 10671 global/generic 10672 store/atomicrmw are 10673 visible at agent scope. 10674 10675 2. s_waitcnt lgkmcnt(0) & 10676 vmcnt(0) 10677 10678 - If TgSplit execution mode, 10679 omit lgkmcnt(0). 10680 - If OpenCL and 10681 address space is 10682 not generic, omit 10683 lgkmcnt(0). 10684 - However, since LLVM 10685 currently has no 10686 address space on 10687 the fence need to 10688 conservatively 10689 always generate 10690 (see comment for 10691 previous fence). 10692 - Could be split into 10693 separate s_waitcnt 10694 vmcnt(0) and 10695 s_waitcnt 10696 lgkmcnt(0) to allow 10697 them to be 10698 independently moved 10699 according to the 10700 following rules. 10701 - s_waitcnt vmcnt(0) 10702 must happen after 10703 any preceding 10704 global/generic 10705 load/store/load 10706 atomic/store 10707 atomic/atomicrmw. 10708 - s_waitcnt lgkmcnt(0) 10709 must happen after 10710 any preceding 10711 local/generic 10712 load/store/load 10713 atomic/store 10714 atomic/atomicrmw. 10715 - Must happen before 10716 the following 10717 buffer_inv. 10718 - Ensures that the 10719 preceding 10720 global/local/generic 10721 load 10722 atomic/atomicrmw 10723 with an equal or 10724 wider sync scope 10725 and memory ordering 10726 stronger than 10727 unordered (this is 10728 termed the 10729 acquire-fence-paired-atomic) 10730 has completed 10731 before invalidating 10732 the cache. This 10733 satisfies the 10734 requirements of 10735 acquire. 10736 - Ensures that all 10737 previous memory 10738 operations have 10739 completed before a 10740 following 10741 global/local/generic 10742 store 10743 atomic/atomicrmw 10744 with an equal or 10745 wider sync scope 10746 and memory ordering 10747 stronger than 10748 unordered (this is 10749 termed the 10750 release-fence-paired-atomic). 10751 This satisfies the 10752 requirements of 10753 release. 10754 10755 3. buffer_inv sc1=1 10756 10757 - Must happen before 10758 any following 10759 global/generic 10760 load/load 10761 atomic/store/store 10762 atomic/atomicrmw. 10763 - Ensures that 10764 following loads 10765 will not see stale 10766 global data. This 10767 satisfies the 10768 requirements of 10769 acquire. 10770 10771 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1 10772 10773 - If OpenCL and 10774 address space is 10775 local, omit. 10776 - Must happen before 10777 following s_waitcnt. 10778 - Performs L2 writeback to 10779 ensure previous 10780 global/generic 10781 store/atomicrmw are 10782 visible at system scope. 10783 10784 1. s_waitcnt lgkmcnt(0) & 10785 vmcnt(0) 10786 10787 - If TgSplit execution mode, 10788 omit lgkmcnt(0). 10789 - If OpenCL and 10790 address space is 10791 not generic, omit 10792 lgkmcnt(0). 10793 - However, since LLVM 10794 currently has no 10795 address space on 10796 the fence need to 10797 conservatively 10798 always generate 10799 (see comment for 10800 previous fence). 10801 - Could be split into 10802 separate s_waitcnt 10803 vmcnt(0) and 10804 s_waitcnt 10805 lgkmcnt(0) to allow 10806 them to be 10807 independently moved 10808 according to the 10809 following rules. 10810 - s_waitcnt vmcnt(0) 10811 must happen after 10812 any preceding 10813 global/generic 10814 load/store/load 10815 atomic/store 10816 atomic/atomicrmw. 10817 - s_waitcnt lgkmcnt(0) 10818 must happen after 10819 any preceding 10820 local/generic 10821 load/store/load 10822 atomic/store 10823 atomic/atomicrmw. 10824 - Must happen before 10825 the following 10826 buffer_inv. 10827 - Ensures that the 10828 preceding 10829 global/local/generic 10830 load 10831 atomic/atomicrmw 10832 with an equal or 10833 wider sync scope 10834 and memory ordering 10835 stronger than 10836 unordered (this is 10837 termed the 10838 acquire-fence-paired-atomic) 10839 has completed 10840 before invalidating 10841 the cache. This 10842 satisfies the 10843 requirements of 10844 acquire. 10845 - Ensures that all 10846 previous memory 10847 operations have 10848 completed before a 10849 following 10850 global/local/generic 10851 store 10852 atomic/atomicrmw 10853 with an equal or 10854 wider sync scope 10855 and memory ordering 10856 stronger than 10857 unordered (this is 10858 termed the 10859 release-fence-paired-atomic). 10860 This satisfies the 10861 requirements of 10862 release. 10863 10864 2. buffer_inv sc0=1 sc1=1 10865 10866 - Must happen before 10867 any following 10868 global/generic 10869 load/load 10870 atomic/store/store 10871 atomic/atomicrmw. 10872 - Ensures that 10873 following loads 10874 will not see stale 10875 MTYPE NC global data. 10876 MTYPE RW and CC memory will 10877 never be stale due to the 10878 memory probes. 10879 10880 **Sequential Consistent Atomic** 10881 ------------------------------------------------------------------------------------ 10882 load atomic seq_cst - singlethread - global *Same as corresponding 10883 - wavefront - local load atomic acquire, 10884 - generic except must generate 10885 all instructions even 10886 for OpenCL.* 10887 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 10888 - generic 10889 - Use lgkmcnt(0) if not 10890 TgSplit execution mode 10891 and vmcnt(0) if TgSplit 10892 execution mode. 10893 - s_waitcnt lgkmcnt(0) must 10894 happen after 10895 preceding 10896 local/generic load 10897 atomic/store 10898 atomic/atomicrmw 10899 with memory 10900 ordering of seq_cst 10901 and with equal or 10902 wider sync scope. 10903 (Note that seq_cst 10904 fences have their 10905 own s_waitcnt 10906 lgkmcnt(0) and so do 10907 not need to be 10908 considered.) 10909 - s_waitcnt vmcnt(0) 10910 must happen after 10911 preceding 10912 global/generic load 10913 atomic/store 10914 atomic/atomicrmw 10915 with memory 10916 ordering of seq_cst 10917 and with equal or 10918 wider sync scope. 10919 (Note that seq_cst 10920 fences have their 10921 own s_waitcnt 10922 vmcnt(0) and so do 10923 not need to be 10924 considered.) 10925 - Ensures any 10926 preceding 10927 sequential 10928 consistent global/local 10929 memory instructions 10930 have completed 10931 before executing 10932 this sequentially 10933 consistent 10934 instruction. This 10935 prevents reordering 10936 a seq_cst store 10937 followed by a 10938 seq_cst load. (Note 10939 that seq_cst is 10940 stronger than 10941 acquire/release as 10942 the reordering of 10943 load acquire 10944 followed by a store 10945 release is 10946 prevented by the 10947 s_waitcnt of 10948 the release, but 10949 there is nothing 10950 preventing a store 10951 release followed by 10952 load acquire from 10953 completing out of 10954 order. The s_waitcnt 10955 could be placed after 10956 seq_store or before 10957 the seq_load. We 10958 choose the load to 10959 make the s_waitcnt be 10960 as late as possible 10961 so that the store 10962 may have already 10963 completed.) 10964 10965 2. *Following 10966 instructions same as 10967 corresponding load 10968 atomic acquire, 10969 except must generate 10970 all instructions even 10971 for OpenCL.* 10972 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 10973 local address space cannot 10974 be used.* 10975 10976 *Same as corresponding 10977 load atomic acquire, 10978 except must generate 10979 all instructions even 10980 for OpenCL.* 10981 10982 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 10983 - system - generic vmcnt(0) 10984 10985 - If TgSplit execution mode, 10986 omit lgkmcnt(0). 10987 - Could be split into 10988 separate s_waitcnt 10989 vmcnt(0) 10990 and s_waitcnt 10991 lgkmcnt(0) to allow 10992 them to be 10993 independently moved 10994 according to the 10995 following rules. 10996 - s_waitcnt lgkmcnt(0) 10997 must happen after 10998 preceding 10999 global/generic load 11000 atomic/store 11001 atomic/atomicrmw 11002 with memory 11003 ordering of seq_cst 11004 and with equal or 11005 wider sync scope. 11006 (Note that seq_cst 11007 fences have their 11008 own s_waitcnt 11009 lgkmcnt(0) and so do 11010 not need to be 11011 considered.) 11012 - s_waitcnt vmcnt(0) 11013 must happen after 11014 preceding 11015 global/generic load 11016 atomic/store 11017 atomic/atomicrmw 11018 with memory 11019 ordering of seq_cst 11020 and with equal or 11021 wider sync scope. 11022 (Note that seq_cst 11023 fences have their 11024 own s_waitcnt 11025 vmcnt(0) and so do 11026 not need to be 11027 considered.) 11028 - Ensures any 11029 preceding 11030 sequential 11031 consistent global 11032 memory instructions 11033 have completed 11034 before executing 11035 this sequentially 11036 consistent 11037 instruction. This 11038 prevents reordering 11039 a seq_cst store 11040 followed by a 11041 seq_cst load. (Note 11042 that seq_cst is 11043 stronger than 11044 acquire/release as 11045 the reordering of 11046 load acquire 11047 followed by a store 11048 release is 11049 prevented by the 11050 s_waitcnt of 11051 the release, but 11052 there is nothing 11053 preventing a store 11054 release followed by 11055 load acquire from 11056 completing out of 11057 order. The s_waitcnt 11058 could be placed after 11059 seq_store or before 11060 the seq_load. We 11061 choose the load to 11062 make the s_waitcnt be 11063 as late as possible 11064 so that the store 11065 may have already 11066 completed.) 11067 11068 2. *Following 11069 instructions same as 11070 corresponding load 11071 atomic acquire, 11072 except must generate 11073 all instructions even 11074 for OpenCL.* 11075 store atomic seq_cst - singlethread - global *Same as corresponding 11076 - wavefront - local store atomic release, 11077 - workgroup - generic except must generate 11078 - agent all instructions even 11079 - system for OpenCL.* 11080 atomicrmw seq_cst - singlethread - global *Same as corresponding 11081 - wavefront - local atomicrmw acq_rel, 11082 - workgroup - generic except must generate 11083 - agent all instructions even 11084 - system for OpenCL.* 11085 fence seq_cst - singlethread *none* *Same as corresponding 11086 - wavefront fence acq_rel, 11087 - workgroup except must generate 11088 - agent all instructions even 11089 - system for OpenCL.* 11090 ============ ============ ============== ========== ================================ 11091 11092.. _amdgpu-amdhsa-memory-model-gfx10: 11093 11094Memory Model GFX10 11095++++++++++++++++++ 11096 11097For GFX10: 11098 11099* Each agent has multiple shader arrays (SA). 11100* Each SA has multiple work-group processors (WGP). 11101* Each WGP has multiple compute units (CU). 11102* Each CU has multiple SIMDs that execute wavefronts. 11103* The wavefronts for a single work-group are executed in the same 11104 WGP. In CU wavefront execution mode the wavefronts may be executed by 11105 different SIMDs in the same CU. In WGP wavefront execution mode the 11106 wavefronts may be executed by different SIMDs in different CUs in the same 11107 WGP. 11108* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 11109 executing on it. 11110* All LDS operations of a WGP are performed as wavefront wide operations in a 11111 global order and involve no caching. Completion is reported to a wavefront in 11112 execution order. 11113* The LDS memory has multiple request queues shared by the SIMDs of a 11114 WGP. Therefore, the LDS operations performed by different wavefronts of a 11115 work-group can be reordered relative to each other, which can result in 11116 reordering the visibility of vector memory operations with respect to LDS 11117 operations of other wavefronts in the same work-group. A ``s_waitcnt 11118 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 11119 vector memory operations between wavefronts of a work-group, but not between 11120 operations performed by the same wavefront. 11121* The vector memory operations are performed as wavefront wide operations. 11122 Completion of load/store/sample operations are reported to a wavefront in 11123 execution order of other load/store/sample operations performed by that 11124 wavefront. 11125* The vector memory operations access a vector L0 cache. There is a single L0 11126 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 11127 special action is required for coherence between the lanes of a single 11128 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 11129 wavefronts executing in the same work-group as they may be executing on SIMDs 11130 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 11131 required for coherence between wavefronts executing in different work-groups 11132 as they may be executing on different WGPs. 11133* The scalar memory operations access a scalar L0 cache shared by all wavefronts 11134 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 11135 operations are used in a restricted way so do not impact the memory model. See 11136 :ref:`amdgpu-amdhsa-memory-spaces`. 11137* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 11138 the same SA. Therefore, no special action is required for coherence between 11139 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 11140 required for coherence between wavefronts executing in different work-groups 11141 as they may be executing on different SAs that access different L1s. 11142* The L1 caches have independent quadrants to service disjoint ranges of virtual 11143 addresses. 11144* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 11145 vector and scalar memory operations performed by different wavefronts, whether 11146 executing in the same or different work-groups (which may be executing on 11147 different CUs accessing different L0s), can be reordered relative to each 11148 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 11149 synchronization between vector memory operations of different wavefronts. It 11150 ensures a previous vector memory operation has completed before executing a 11151 subsequent vector memory or LDS operation and so can be used to meet the 11152 requirements of acquire, release and sequential consistency. 11153* The L1 caches use an L2 cache shared by all SAs on the same agent. 11154* The L2 cache has independent channels to service disjoint ranges of virtual 11155 addresses. 11156* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 11157 quadrant has a separate request queue per L2 channel. Therefore, the vector 11158 and scalar memory operations performed by wavefronts executing in different 11159 work-groups (which may be executing on different SAs) of an agent can be 11160 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 11161 required to ensure synchronization between vector memory operations of 11162 different SAs. It ensures a previous vector memory operation has completed 11163 before executing a subsequent vector memory and so can be used to meet the 11164 requirements of acquire, release and sequential consistency. 11165* The L2 cache can be kept coherent with other agents on some targets, or ranges 11166 of virtual addresses can be set up to bypass it to ensure system coherence. 11167* On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory. 11168 The MALL cache is fully coherent with GPU memory and has no impact on system 11169 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache. 11170 11171Scalar memory operations are only used to access memory that is proven to not 11172change during the execution of the kernel dispatch. This includes constant 11173address space and global address space for program scope ``const`` variables. 11174Therefore, the kernel machine code does not have to maintain the scalar cache to 11175ensure it is coherent with the vector caches. The scalar and vector caches are 11176invalidated between kernel dispatches by CP since constant address space data 11177may change between kernel dispatch executions. See 11178:ref:`amdgpu-amdhsa-memory-spaces`. 11179 11180The one exception is if scalar writes are used to spill SGPR registers. In this 11181case the AMDGPU backend ensures the memory location used to spill is never 11182accessed by vector memory operations at the same time. If scalar writes are used 11183then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 11184return since the locations may be used for vector memory instructions by a 11185future wavefront that uses the same scratch area, or a function call that 11186creates a frame at the same address, respectively. There is no need for a 11187``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 11188 11189For kernarg backing memory: 11190 11191* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 11192* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 11193 needing to invalidate the L2 cache. 11194* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 11195 so the L2 cache will be coherent with the CPU and other agents. 11196 11197Scratch backing memory (which is used for the private address space) is accessed 11198with MTYPE NC (non-coherent). Since the private address space is only accessed 11199by a single thread, and is always write-before-read, there is never a need to 11200invalidate these entries from the L0 or L1 caches. 11201 11202Wavefronts are executed in native mode with in-order reporting of loads and 11203sample instructions. In this mode vmcnt reports completion of load, atomic with 11204return and sample instructions in order, and the vscnt reports the completion of 11205store and atomic without return in order. See ``MEM_ORDERED`` field in 11206:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11207 11208Wavefronts can be executed in WGP or CU wavefront execution mode: 11209 11210* In WGP wavefront execution mode the wavefronts of a work-group are executed 11211 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 11212 CU L0 caches is required for work-group synchronization. Also accesses to L1 11213 at work-group scope need to be explicitly ordered as the accesses from 11214 different CUs are not ordered. 11215* In CU wavefront execution mode the wavefronts of a work-group are executed on 11216 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 11217 the work-group access the same L0 which in turn ensures L1 accesses are 11218 ordered and so do not require explicit management of the caches for 11219 work-group synchronization. 11220 11221See ``WGP_MODE`` field in 11222:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 11223:ref:`amdgpu-target-features`. 11224 11225The code sequences used to implement the memory model for GFX10 are defined in 11226table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 11227 11228 .. table:: AMDHSA Memory Model Code Sequences GFX10 11229 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 11230 11231 ============ ============ ============== ========== ================================ 11232 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 11233 Ordering Sync Scope Address GFX10 11234 Space 11235 ============ ============ ============== ========== ================================ 11236 **Non-Atomic** 11237 ------------------------------------------------------------------------------------ 11238 load *none* *none* - global - !volatile & !nontemporal 11239 - generic 11240 - private 1. buffer/global/flat_load 11241 - constant 11242 - !volatile & nontemporal 11243 11244 1. buffer/global/flat_load 11245 slc=1 11246 11247 - volatile 11248 11249 1. buffer/global/flat_load 11250 glc=1 dlc=1 11251 2. s_waitcnt vmcnt(0) 11252 11253 - Must happen before 11254 any following volatile 11255 global/generic 11256 load/store. 11257 - Ensures that 11258 volatile 11259 operations to 11260 different 11261 addresses will not 11262 be reordered by 11263 hardware. 11264 11265 load *none* *none* - local 1. ds_load 11266 store *none* *none* - global - !volatile & !nontemporal 11267 - generic 11268 - private 1. buffer/global/flat_store 11269 - constant 11270 - !volatile & nontemporal 11271 11272 1. buffer/global/flat_store 11273 glc=1 slc=1 11274 11275 - volatile 11276 11277 1. buffer/global/flat_store 11278 2. s_waitcnt vscnt(0) 11279 11280 - Must happen before 11281 any following volatile 11282 global/generic 11283 load/store. 11284 - Ensures that 11285 volatile 11286 operations to 11287 different 11288 addresses will not 11289 be reordered by 11290 hardware. 11291 11292 store *none* *none* - local 1. ds_store 11293 **Unordered Atomic** 11294 ------------------------------------------------------------------------------------ 11295 load atomic unordered *any* *any* *Same as non-atomic*. 11296 store atomic unordered *any* *any* *Same as non-atomic*. 11297 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 11298 **Monotonic Atomic** 11299 ------------------------------------------------------------------------------------ 11300 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 11301 - wavefront - generic 11302 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 11303 - generic glc=1 11304 11305 - If CU wavefront execution 11306 mode, omit glc=1. 11307 11308 load atomic monotonic - singlethread - local 1. ds_load 11309 - wavefront 11310 - workgroup 11311 load atomic monotonic - agent - global 1. buffer/global/flat_load 11312 - system - generic glc=1 dlc=1 11313 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 11314 - wavefront - generic 11315 - workgroup 11316 - agent 11317 - system 11318 store atomic monotonic - singlethread - local 1. ds_store 11319 - wavefront 11320 - workgroup 11321 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 11322 - wavefront - generic 11323 - workgroup 11324 - agent 11325 - system 11326 atomicrmw monotonic - singlethread - local 1. ds_atomic 11327 - wavefront 11328 - workgroup 11329 **Acquire Atomic** 11330 ------------------------------------------------------------------------------------ 11331 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 11332 - wavefront - local 11333 - generic 11334 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 11335 11336 - If CU wavefront execution 11337 mode, omit glc=1. 11338 11339 2. s_waitcnt vmcnt(0) 11340 11341 - If CU wavefront execution 11342 mode, omit. 11343 - Must happen before 11344 the following buffer_gl0_inv 11345 and before any following 11346 global/generic 11347 load/load 11348 atomic/store/store 11349 atomic/atomicrmw. 11350 11351 3. buffer_gl0_inv 11352 11353 - If CU wavefront execution 11354 mode, omit. 11355 - Ensures that 11356 following 11357 loads will not see 11358 stale data. 11359 11360 load atomic acquire - workgroup - local 1. ds_load 11361 2. s_waitcnt lgkmcnt(0) 11362 11363 - If OpenCL, omit. 11364 - Must happen before 11365 the following buffer_gl0_inv 11366 and before any following 11367 global/generic load/load 11368 atomic/store/store 11369 atomic/atomicrmw. 11370 - Ensures any 11371 following global 11372 data read is no 11373 older than the local load 11374 atomic value being 11375 acquired. 11376 11377 3. buffer_gl0_inv 11378 11379 - If CU wavefront execution 11380 mode, omit. 11381 - If OpenCL, omit. 11382 - Ensures that 11383 following 11384 loads will not see 11385 stale data. 11386 11387 load atomic acquire - workgroup - generic 1. flat_load glc=1 11388 11389 - If CU wavefront execution 11390 mode, omit glc=1. 11391 11392 2. s_waitcnt lgkmcnt(0) & 11393 vmcnt(0) 11394 11395 - If CU wavefront execution 11396 mode, omit vmcnt(0). 11397 - If OpenCL, omit 11398 lgkmcnt(0). 11399 - Must happen before 11400 the following 11401 buffer_gl0_inv and any 11402 following global/generic 11403 load/load 11404 atomic/store/store 11405 atomic/atomicrmw. 11406 - Ensures any 11407 following global 11408 data read is no 11409 older than a local load 11410 atomic value being 11411 acquired. 11412 11413 3. buffer_gl0_inv 11414 11415 - If CU wavefront execution 11416 mode, omit. 11417 - Ensures that 11418 following 11419 loads will not see 11420 stale data. 11421 11422 load atomic acquire - agent - global 1. buffer/global_load 11423 - system glc=1 dlc=1 11424 2. s_waitcnt vmcnt(0) 11425 11426 - Must happen before 11427 following 11428 buffer_gl*_inv. 11429 - Ensures the load 11430 has completed 11431 before invalidating 11432 the caches. 11433 11434 3. buffer_gl0_inv; 11435 buffer_gl1_inv 11436 11437 - Must happen before 11438 any following 11439 global/generic 11440 load/load 11441 atomic/atomicrmw. 11442 - Ensures that 11443 following 11444 loads will not see 11445 stale global data. 11446 11447 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 11448 - system 2. s_waitcnt vmcnt(0) & 11449 lgkmcnt(0) 11450 11451 - If OpenCL omit 11452 lgkmcnt(0). 11453 - Must happen before 11454 following 11455 buffer_gl*_invl. 11456 - Ensures the flat_load 11457 has completed 11458 before invalidating 11459 the caches. 11460 11461 3. buffer_gl0_inv; 11462 buffer_gl1_inv 11463 11464 - Must happen before 11465 any following 11466 global/generic 11467 load/load 11468 atomic/atomicrmw. 11469 - Ensures that 11470 following loads 11471 will not see stale 11472 global data. 11473 11474 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 11475 - wavefront - local 11476 - generic 11477 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 11478 2. s_waitcnt vm/vscnt(0) 11479 11480 - If CU wavefront execution 11481 mode, omit. 11482 - Use vmcnt(0) if atomic with 11483 return and vscnt(0) if 11484 atomic with no-return. 11485 - Must happen before 11486 the following buffer_gl0_inv 11487 and before any following 11488 global/generic 11489 load/load 11490 atomic/store/store 11491 atomic/atomicrmw. 11492 11493 3. buffer_gl0_inv 11494 11495 - If CU wavefront execution 11496 mode, omit. 11497 - Ensures that 11498 following 11499 loads will not see 11500 stale data. 11501 11502 atomicrmw acquire - workgroup - local 1. ds_atomic 11503 2. s_waitcnt lgkmcnt(0) 11504 11505 - If OpenCL, omit. 11506 - Must happen before 11507 the following 11508 buffer_gl0_inv. 11509 - Ensures any 11510 following global 11511 data read is no 11512 older than the local 11513 atomicrmw value 11514 being acquired. 11515 11516 3. buffer_gl0_inv 11517 11518 - If OpenCL omit. 11519 - Ensures that 11520 following 11521 loads will not see 11522 stale data. 11523 11524 atomicrmw acquire - workgroup - generic 1. flat_atomic 11525 2. s_waitcnt lgkmcnt(0) & 11526 vm/vscnt(0) 11527 11528 - If CU wavefront execution 11529 mode, omit vm/vscnt(0). 11530 - If OpenCL, omit lgkmcnt(0). 11531 - Use vmcnt(0) if atomic with 11532 return and vscnt(0) if 11533 atomic with no-return. 11534 - Must happen before 11535 the following 11536 buffer_gl0_inv. 11537 - Ensures any 11538 following global 11539 data read is no 11540 older than a local 11541 atomicrmw value 11542 being acquired. 11543 11544 3. buffer_gl0_inv 11545 11546 - If CU wavefront execution 11547 mode, omit. 11548 - Ensures that 11549 following 11550 loads will not see 11551 stale data. 11552 11553 atomicrmw acquire - agent - global 1. buffer/global_atomic 11554 - system 2. s_waitcnt vm/vscnt(0) 11555 11556 - Use vmcnt(0) if atomic with 11557 return and vscnt(0) if 11558 atomic with no-return. 11559 - Must happen before 11560 following 11561 buffer_gl*_inv. 11562 - Ensures the 11563 atomicrmw has 11564 completed before 11565 invalidating the 11566 caches. 11567 11568 3. buffer_gl0_inv; 11569 buffer_gl1_inv 11570 11571 - Must happen before 11572 any following 11573 global/generic 11574 load/load 11575 atomic/atomicrmw. 11576 - Ensures that 11577 following loads 11578 will not see stale 11579 global data. 11580 11581 atomicrmw acquire - agent - generic 1. flat_atomic 11582 - system 2. s_waitcnt vm/vscnt(0) & 11583 lgkmcnt(0) 11584 11585 - If OpenCL, omit 11586 lgkmcnt(0). 11587 - Use vmcnt(0) if atomic with 11588 return and vscnt(0) if 11589 atomic with no-return. 11590 - Must happen before 11591 following 11592 buffer_gl*_inv. 11593 - Ensures the 11594 atomicrmw has 11595 completed before 11596 invalidating the 11597 caches. 11598 11599 3. buffer_gl0_inv; 11600 buffer_gl1_inv 11601 11602 - Must happen before 11603 any following 11604 global/generic 11605 load/load 11606 atomic/atomicrmw. 11607 - Ensures that 11608 following loads 11609 will not see stale 11610 global data. 11611 11612 fence acquire - singlethread *none* *none* 11613 - wavefront 11614 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 11615 vmcnt(0) & vscnt(0) 11616 11617 - If CU wavefront execution 11618 mode, omit vmcnt(0) and 11619 vscnt(0). 11620 - If OpenCL and 11621 address space is 11622 not generic, omit 11623 lgkmcnt(0). 11624 - If OpenCL and 11625 address space is 11626 local, omit 11627 vmcnt(0) and vscnt(0). 11628 - However, since LLVM 11629 currently has no 11630 address space on 11631 the fence need to 11632 conservatively 11633 always generate. If 11634 fence had an 11635 address space then 11636 set to address 11637 space of OpenCL 11638 fence flag, or to 11639 generic if both 11640 local and global 11641 flags are 11642 specified. 11643 - Could be split into 11644 separate s_waitcnt 11645 vmcnt(0), s_waitcnt 11646 vscnt(0) and s_waitcnt 11647 lgkmcnt(0) to allow 11648 them to be 11649 independently moved 11650 according to the 11651 following rules. 11652 - s_waitcnt vmcnt(0) 11653 must happen after 11654 any preceding 11655 global/generic load 11656 atomic/ 11657 atomicrmw-with-return-value 11658 with an equal or 11659 wider sync scope 11660 and memory ordering 11661 stronger than 11662 unordered (this is 11663 termed the 11664 fence-paired-atomic). 11665 - s_waitcnt vscnt(0) 11666 must happen after 11667 any preceding 11668 global/generic 11669 atomicrmw-no-return-value 11670 with an equal or 11671 wider sync scope 11672 and memory ordering 11673 stronger than 11674 unordered (this is 11675 termed the 11676 fence-paired-atomic). 11677 - s_waitcnt lgkmcnt(0) 11678 must happen after 11679 any preceding 11680 local/generic load 11681 atomic/atomicrmw 11682 with an equal or 11683 wider sync scope 11684 and memory ordering 11685 stronger than 11686 unordered (this is 11687 termed the 11688 fence-paired-atomic). 11689 - Must happen before 11690 the following 11691 buffer_gl0_inv. 11692 - Ensures that the 11693 fence-paired atomic 11694 has completed 11695 before invalidating 11696 the 11697 cache. Therefore 11698 any following 11699 locations read must 11700 be no older than 11701 the value read by 11702 the 11703 fence-paired-atomic. 11704 11705 3. buffer_gl0_inv 11706 11707 - If CU wavefront execution 11708 mode, omit. 11709 - Ensures that 11710 following 11711 loads will not see 11712 stale data. 11713 11714 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 11715 - system vmcnt(0) & vscnt(0) 11716 11717 - If OpenCL and 11718 address space is 11719 not generic, omit 11720 lgkmcnt(0). 11721 - If OpenCL and 11722 address space is 11723 local, omit 11724 vmcnt(0) and vscnt(0). 11725 - However, since LLVM 11726 currently has no 11727 address space on 11728 the fence need to 11729 conservatively 11730 always generate 11731 (see comment for 11732 previous fence). 11733 - Could be split into 11734 separate s_waitcnt 11735 vmcnt(0), s_waitcnt 11736 vscnt(0) and s_waitcnt 11737 lgkmcnt(0) to allow 11738 them to be 11739 independently moved 11740 according to the 11741 following rules. 11742 - s_waitcnt vmcnt(0) 11743 must happen after 11744 any preceding 11745 global/generic load 11746 atomic/ 11747 atomicrmw-with-return-value 11748 with an equal or 11749 wider sync scope 11750 and memory ordering 11751 stronger than 11752 unordered (this is 11753 termed the 11754 fence-paired-atomic). 11755 - s_waitcnt vscnt(0) 11756 must happen after 11757 any preceding 11758 global/generic 11759 atomicrmw-no-return-value 11760 with an equal or 11761 wider sync scope 11762 and memory ordering 11763 stronger than 11764 unordered (this is 11765 termed the 11766 fence-paired-atomic). 11767 - s_waitcnt lgkmcnt(0) 11768 must happen after 11769 any preceding 11770 local/generic load 11771 atomic/atomicrmw 11772 with an equal or 11773 wider sync scope 11774 and memory ordering 11775 stronger than 11776 unordered (this is 11777 termed the 11778 fence-paired-atomic). 11779 - Must happen before 11780 the following 11781 buffer_gl*_inv. 11782 - Ensures that the 11783 fence-paired atomic 11784 has completed 11785 before invalidating 11786 the 11787 caches. Therefore 11788 any following 11789 locations read must 11790 be no older than 11791 the value read by 11792 the 11793 fence-paired-atomic. 11794 11795 2. buffer_gl0_inv; 11796 buffer_gl1_inv 11797 11798 - Must happen before any 11799 following global/generic 11800 load/load 11801 atomic/store/store 11802 atomic/atomicrmw. 11803 - Ensures that 11804 following loads 11805 will not see stale 11806 global data. 11807 11808 **Release Atomic** 11809 ------------------------------------------------------------------------------------ 11810 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 11811 - wavefront - local 11812 - generic 11813 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 11814 - generic vmcnt(0) & vscnt(0) 11815 11816 - If CU wavefront execution 11817 mode, omit vmcnt(0) and 11818 vscnt(0). 11819 - If OpenCL, omit 11820 lgkmcnt(0). 11821 - Could be split into 11822 separate s_waitcnt 11823 vmcnt(0), s_waitcnt 11824 vscnt(0) and s_waitcnt 11825 lgkmcnt(0) to allow 11826 them to be 11827 independently moved 11828 according to the 11829 following rules. 11830 - s_waitcnt vmcnt(0) 11831 must happen after 11832 any preceding 11833 global/generic load/load 11834 atomic/ 11835 atomicrmw-with-return-value. 11836 - s_waitcnt vscnt(0) 11837 must happen after 11838 any preceding 11839 global/generic 11840 store/store 11841 atomic/ 11842 atomicrmw-no-return-value. 11843 - s_waitcnt lgkmcnt(0) 11844 must happen after 11845 any preceding 11846 local/generic 11847 load/store/load 11848 atomic/store 11849 atomic/atomicrmw. 11850 - Must happen before 11851 the following 11852 store. 11853 - Ensures that all 11854 memory operations 11855 have 11856 completed before 11857 performing the 11858 store that is being 11859 released. 11860 11861 2. buffer/global/flat_store 11862 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 11863 11864 - If CU wavefront execution 11865 mode, omit. 11866 - If OpenCL, omit. 11867 - Could be split into 11868 separate s_waitcnt 11869 vmcnt(0) and s_waitcnt 11870 vscnt(0) to allow 11871 them to be 11872 independently moved 11873 according to the 11874 following rules. 11875 - s_waitcnt vmcnt(0) 11876 must happen after 11877 any preceding 11878 global/generic load/load 11879 atomic/ 11880 atomicrmw-with-return-value. 11881 - s_waitcnt vscnt(0) 11882 must happen after 11883 any preceding 11884 global/generic 11885 store/store atomic/ 11886 atomicrmw-no-return-value. 11887 - Must happen before 11888 the following 11889 store. 11890 - Ensures that all 11891 global memory 11892 operations have 11893 completed before 11894 performing the 11895 store that is being 11896 released. 11897 11898 2. ds_store 11899 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 11900 - system - generic vmcnt(0) & vscnt(0) 11901 11902 - If OpenCL and 11903 address space is 11904 not generic, omit 11905 lgkmcnt(0). 11906 - Could be split into 11907 separate s_waitcnt 11908 vmcnt(0), s_waitcnt vscnt(0) 11909 and s_waitcnt 11910 lgkmcnt(0) to allow 11911 them to be 11912 independently moved 11913 according to the 11914 following rules. 11915 - s_waitcnt vmcnt(0) 11916 must happen after 11917 any preceding 11918 global/generic 11919 load/load 11920 atomic/ 11921 atomicrmw-with-return-value. 11922 - s_waitcnt vscnt(0) 11923 must happen after 11924 any preceding 11925 global/generic 11926 store/store atomic/ 11927 atomicrmw-no-return-value. 11928 - s_waitcnt lgkmcnt(0) 11929 must happen after 11930 any preceding 11931 local/generic 11932 load/store/load 11933 atomic/store 11934 atomic/atomicrmw. 11935 - Must happen before 11936 the following 11937 store. 11938 - Ensures that all 11939 memory operations 11940 have 11941 completed before 11942 performing the 11943 store that is being 11944 released. 11945 11946 2. buffer/global/flat_store 11947 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 11948 - wavefront - local 11949 - generic 11950 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 11951 - generic vmcnt(0) & vscnt(0) 11952 11953 - If CU wavefront execution 11954 mode, omit vmcnt(0) and 11955 vscnt(0). 11956 - If OpenCL, omit lgkmcnt(0). 11957 - Could be split into 11958 separate s_waitcnt 11959 vmcnt(0), s_waitcnt 11960 vscnt(0) and s_waitcnt 11961 lgkmcnt(0) to allow 11962 them to be 11963 independently moved 11964 according to the 11965 following rules. 11966 - s_waitcnt vmcnt(0) 11967 must happen after 11968 any preceding 11969 global/generic load/load 11970 atomic/ 11971 atomicrmw-with-return-value. 11972 - s_waitcnt vscnt(0) 11973 must happen after 11974 any preceding 11975 global/generic 11976 store/store 11977 atomic/ 11978 atomicrmw-no-return-value. 11979 - s_waitcnt lgkmcnt(0) 11980 must happen after 11981 any preceding 11982 local/generic 11983 load/store/load 11984 atomic/store 11985 atomic/atomicrmw. 11986 - Must happen before 11987 the following 11988 atomicrmw. 11989 - Ensures that all 11990 memory operations 11991 have 11992 completed before 11993 performing the 11994 atomicrmw that is 11995 being released. 11996 11997 2. buffer/global/flat_atomic 11998 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 11999 12000 - If CU wavefront execution 12001 mode, omit. 12002 - If OpenCL, omit. 12003 - Could be split into 12004 separate s_waitcnt 12005 vmcnt(0) and s_waitcnt 12006 vscnt(0) to allow 12007 them to be 12008 independently moved 12009 according to the 12010 following rules. 12011 - s_waitcnt vmcnt(0) 12012 must happen after 12013 any preceding 12014 global/generic load/load 12015 atomic/ 12016 atomicrmw-with-return-value. 12017 - s_waitcnt vscnt(0) 12018 must happen after 12019 any preceding 12020 global/generic 12021 store/store atomic/ 12022 atomicrmw-no-return-value. 12023 - Must happen before 12024 the following 12025 store. 12026 - Ensures that all 12027 global memory 12028 operations have 12029 completed before 12030 performing the 12031 store that is being 12032 released. 12033 12034 2. ds_atomic 12035 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 12036 - system - generic vmcnt(0) & vscnt(0) 12037 12038 - If OpenCL, omit 12039 lgkmcnt(0). 12040 - Could be split into 12041 separate s_waitcnt 12042 vmcnt(0), s_waitcnt 12043 vscnt(0) and s_waitcnt 12044 lgkmcnt(0) to allow 12045 them to be 12046 independently moved 12047 according to the 12048 following rules. 12049 - s_waitcnt vmcnt(0) 12050 must happen after 12051 any preceding 12052 global/generic 12053 load/load atomic/ 12054 atomicrmw-with-return-value. 12055 - s_waitcnt vscnt(0) 12056 must happen after 12057 any preceding 12058 global/generic 12059 store/store atomic/ 12060 atomicrmw-no-return-value. 12061 - s_waitcnt lgkmcnt(0) 12062 must happen after 12063 any preceding 12064 local/generic 12065 load/store/load 12066 atomic/store 12067 atomic/atomicrmw. 12068 - Must happen before 12069 the following 12070 atomicrmw. 12071 - Ensures that all 12072 memory operations 12073 to global and local 12074 have completed 12075 before performing 12076 the atomicrmw that 12077 is being released. 12078 12079 2. buffer/global/flat_atomic 12080 fence release - singlethread *none* *none* 12081 - wavefront 12082 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12083 vmcnt(0) & vscnt(0) 12084 12085 - If CU wavefront execution 12086 mode, omit vmcnt(0) and 12087 vscnt(0). 12088 - If OpenCL and 12089 address space is 12090 not generic, omit 12091 lgkmcnt(0). 12092 - If OpenCL and 12093 address space is 12094 local, omit 12095 vmcnt(0) and vscnt(0). 12096 - However, since LLVM 12097 currently has no 12098 address space on 12099 the fence need to 12100 conservatively 12101 always generate. If 12102 fence had an 12103 address space then 12104 set to address 12105 space of OpenCL 12106 fence flag, or to 12107 generic if both 12108 local and global 12109 flags are 12110 specified. 12111 - Could be split into 12112 separate s_waitcnt 12113 vmcnt(0), s_waitcnt 12114 vscnt(0) and s_waitcnt 12115 lgkmcnt(0) to allow 12116 them to be 12117 independently moved 12118 according to the 12119 following rules. 12120 - s_waitcnt vmcnt(0) 12121 must happen after 12122 any preceding 12123 global/generic 12124 load/load 12125 atomic/ 12126 atomicrmw-with-return-value. 12127 - s_waitcnt vscnt(0) 12128 must happen after 12129 any preceding 12130 global/generic 12131 store/store atomic/ 12132 atomicrmw-no-return-value. 12133 - s_waitcnt lgkmcnt(0) 12134 must happen after 12135 any preceding 12136 local/generic 12137 load/store/load 12138 atomic/store atomic/ 12139 atomicrmw. 12140 - Must happen before 12141 any following store 12142 atomic/atomicrmw 12143 with an equal or 12144 wider sync scope 12145 and memory ordering 12146 stronger than 12147 unordered (this is 12148 termed the 12149 fence-paired-atomic). 12150 - Ensures that all 12151 memory operations 12152 have 12153 completed before 12154 performing the 12155 following 12156 fence-paired-atomic. 12157 12158 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 12159 - system vmcnt(0) & vscnt(0) 12160 12161 - If OpenCL and 12162 address space is 12163 not generic, omit 12164 lgkmcnt(0). 12165 - If OpenCL and 12166 address space is 12167 local, omit 12168 vmcnt(0) and vscnt(0). 12169 - However, since LLVM 12170 currently has no 12171 address space on 12172 the fence need to 12173 conservatively 12174 always generate. If 12175 fence had an 12176 address space then 12177 set to address 12178 space of OpenCL 12179 fence flag, or to 12180 generic if both 12181 local and global 12182 flags are 12183 specified. 12184 - Could be split into 12185 separate s_waitcnt 12186 vmcnt(0), s_waitcnt 12187 vscnt(0) and s_waitcnt 12188 lgkmcnt(0) to allow 12189 them to be 12190 independently moved 12191 according to the 12192 following rules. 12193 - s_waitcnt vmcnt(0) 12194 must happen after 12195 any preceding 12196 global/generic 12197 load/load atomic/ 12198 atomicrmw-with-return-value. 12199 - s_waitcnt vscnt(0) 12200 must happen after 12201 any preceding 12202 global/generic 12203 store/store atomic/ 12204 atomicrmw-no-return-value. 12205 - s_waitcnt lgkmcnt(0) 12206 must happen after 12207 any preceding 12208 local/generic 12209 load/store/load 12210 atomic/store 12211 atomic/atomicrmw. 12212 - Must happen before 12213 any following store 12214 atomic/atomicrmw 12215 with an equal or 12216 wider sync scope 12217 and memory ordering 12218 stronger than 12219 unordered (this is 12220 termed the 12221 fence-paired-atomic). 12222 - Ensures that all 12223 memory operations 12224 have 12225 completed before 12226 performing the 12227 following 12228 fence-paired-atomic. 12229 12230 **Acquire-Release Atomic** 12231 ------------------------------------------------------------------------------------ 12232 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 12233 - wavefront - local 12234 - generic 12235 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12236 vmcnt(0) & vscnt(0) 12237 12238 - If CU wavefront execution 12239 mode, omit vmcnt(0) and 12240 vscnt(0). 12241 - If OpenCL, omit 12242 lgkmcnt(0). 12243 - Must happen after 12244 any preceding 12245 local/generic 12246 load/store/load 12247 atomic/store 12248 atomic/atomicrmw. 12249 - Could be split into 12250 separate s_waitcnt 12251 vmcnt(0), s_waitcnt 12252 vscnt(0), and s_waitcnt 12253 lgkmcnt(0) to allow 12254 them to be 12255 independently moved 12256 according to the 12257 following rules. 12258 - s_waitcnt vmcnt(0) 12259 must happen after 12260 any preceding 12261 global/generic load/load 12262 atomic/ 12263 atomicrmw-with-return-value. 12264 - s_waitcnt vscnt(0) 12265 must happen after 12266 any preceding 12267 global/generic 12268 store/store 12269 atomic/ 12270 atomicrmw-no-return-value. 12271 - s_waitcnt lgkmcnt(0) 12272 must happen after 12273 any preceding 12274 local/generic 12275 load/store/load 12276 atomic/store 12277 atomic/atomicrmw. 12278 - Must happen before 12279 the following 12280 atomicrmw. 12281 - Ensures that all 12282 memory operations 12283 have 12284 completed before 12285 performing the 12286 atomicrmw that is 12287 being released. 12288 12289 2. buffer/global_atomic 12290 3. s_waitcnt vm/vscnt(0) 12291 12292 - If CU wavefront execution 12293 mode, omit. 12294 - Use vmcnt(0) if atomic with 12295 return and vscnt(0) if 12296 atomic with no-return. 12297 - Must happen before 12298 the following 12299 buffer_gl0_inv. 12300 - Ensures any 12301 following global 12302 data read is no 12303 older than the 12304 atomicrmw value 12305 being acquired. 12306 12307 4. buffer_gl0_inv 12308 12309 - If CU wavefront execution 12310 mode, omit. 12311 - Ensures that 12312 following 12313 loads will not see 12314 stale data. 12315 12316 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 12317 12318 - If CU wavefront execution 12319 mode, omit. 12320 - If OpenCL, omit. 12321 - Could be split into 12322 separate s_waitcnt 12323 vmcnt(0) and s_waitcnt 12324 vscnt(0) to allow 12325 them to be 12326 independently moved 12327 according to the 12328 following rules. 12329 - s_waitcnt vmcnt(0) 12330 must happen after 12331 any preceding 12332 global/generic load/load 12333 atomic/ 12334 atomicrmw-with-return-value. 12335 - s_waitcnt vscnt(0) 12336 must happen after 12337 any preceding 12338 global/generic 12339 store/store atomic/ 12340 atomicrmw-no-return-value. 12341 - Must happen before 12342 the following 12343 store. 12344 - Ensures that all 12345 global memory 12346 operations have 12347 completed before 12348 performing the 12349 store that is being 12350 released. 12351 12352 2. ds_atomic 12353 3. s_waitcnt lgkmcnt(0) 12354 12355 - If OpenCL, omit. 12356 - Must happen before 12357 the following 12358 buffer_gl0_inv. 12359 - Ensures any 12360 following global 12361 data read is no 12362 older than the local load 12363 atomic value being 12364 acquired. 12365 12366 4. buffer_gl0_inv 12367 12368 - If CU wavefront execution 12369 mode, omit. 12370 - If OpenCL omit. 12371 - Ensures that 12372 following 12373 loads will not see 12374 stale data. 12375 12376 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 12377 vmcnt(0) & vscnt(0) 12378 12379 - If CU wavefront execution 12380 mode, omit vmcnt(0) and 12381 vscnt(0). 12382 - If OpenCL, omit lgkmcnt(0). 12383 - Could be split into 12384 separate s_waitcnt 12385 vmcnt(0), s_waitcnt 12386 vscnt(0) and s_waitcnt 12387 lgkmcnt(0) to allow 12388 them to be 12389 independently moved 12390 according to the 12391 following rules. 12392 - s_waitcnt vmcnt(0) 12393 must happen after 12394 any preceding 12395 global/generic load/load 12396 atomic/ 12397 atomicrmw-with-return-value. 12398 - s_waitcnt vscnt(0) 12399 must happen after 12400 any preceding 12401 global/generic 12402 store/store 12403 atomic/ 12404 atomicrmw-no-return-value. 12405 - s_waitcnt lgkmcnt(0) 12406 must happen after 12407 any preceding 12408 local/generic 12409 load/store/load 12410 atomic/store 12411 atomic/atomicrmw. 12412 - Must happen before 12413 the following 12414 atomicrmw. 12415 - Ensures that all 12416 memory operations 12417 have 12418 completed before 12419 performing the 12420 atomicrmw that is 12421 being released. 12422 12423 2. flat_atomic 12424 3. s_waitcnt lgkmcnt(0) & 12425 vmcnt(0) & vscnt(0) 12426 12427 - If CU wavefront execution 12428 mode, omit vmcnt(0) and 12429 vscnt(0). 12430 - If OpenCL, omit lgkmcnt(0). 12431 - Must happen before 12432 the following 12433 buffer_gl0_inv. 12434 - Ensures any 12435 following global 12436 data read is no 12437 older than the load 12438 atomic value being 12439 acquired. 12440 12441 3. buffer_gl0_inv 12442 12443 - If CU wavefront execution 12444 mode, omit. 12445 - Ensures that 12446 following 12447 loads will not see 12448 stale data. 12449 12450 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 12451 - system vmcnt(0) & vscnt(0) 12452 12453 - If OpenCL, omit 12454 lgkmcnt(0). 12455 - Could be split into 12456 separate s_waitcnt 12457 vmcnt(0), s_waitcnt 12458 vscnt(0) and s_waitcnt 12459 lgkmcnt(0) to allow 12460 them to be 12461 independently moved 12462 according to the 12463 following rules. 12464 - s_waitcnt vmcnt(0) 12465 must happen after 12466 any preceding 12467 global/generic 12468 load/load atomic/ 12469 atomicrmw-with-return-value. 12470 - s_waitcnt vscnt(0) 12471 must happen after 12472 any preceding 12473 global/generic 12474 store/store atomic/ 12475 atomicrmw-no-return-value. 12476 - s_waitcnt lgkmcnt(0) 12477 must happen after 12478 any preceding 12479 local/generic 12480 load/store/load 12481 atomic/store 12482 atomic/atomicrmw. 12483 - Must happen before 12484 the following 12485 atomicrmw. 12486 - Ensures that all 12487 memory operations 12488 to global have 12489 completed before 12490 performing the 12491 atomicrmw that is 12492 being released. 12493 12494 2. buffer/global_atomic 12495 3. s_waitcnt vm/vscnt(0) 12496 12497 - Use vmcnt(0) if atomic with 12498 return and vscnt(0) if 12499 atomic with no-return. 12500 - Must happen before 12501 following 12502 buffer_gl*_inv. 12503 - Ensures the 12504 atomicrmw has 12505 completed before 12506 invalidating the 12507 caches. 12508 12509 4. buffer_gl0_inv; 12510 buffer_gl1_inv 12511 12512 - Must happen before 12513 any following 12514 global/generic 12515 load/load 12516 atomic/atomicrmw. 12517 - Ensures that 12518 following loads 12519 will not see stale 12520 global data. 12521 12522 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 12523 - system vmcnt(0) & vscnt(0) 12524 12525 - If OpenCL, omit 12526 lgkmcnt(0). 12527 - Could be split into 12528 separate s_waitcnt 12529 vmcnt(0), s_waitcnt 12530 vscnt(0), and s_waitcnt 12531 lgkmcnt(0) to allow 12532 them to be 12533 independently moved 12534 according to the 12535 following rules. 12536 - s_waitcnt vmcnt(0) 12537 must happen after 12538 any preceding 12539 global/generic 12540 load/load atomic 12541 atomicrmw-with-return-value. 12542 - s_waitcnt vscnt(0) 12543 must happen after 12544 any preceding 12545 global/generic 12546 store/store atomic/ 12547 atomicrmw-no-return-value. 12548 - s_waitcnt lgkmcnt(0) 12549 must happen after 12550 any preceding 12551 local/generic 12552 load/store/load 12553 atomic/store 12554 atomic/atomicrmw. 12555 - Must happen before 12556 the following 12557 atomicrmw. 12558 - Ensures that all 12559 memory operations 12560 have 12561 completed before 12562 performing the 12563 atomicrmw that is 12564 being released. 12565 12566 2. flat_atomic 12567 3. s_waitcnt vm/vscnt(0) & 12568 lgkmcnt(0) 12569 12570 - If OpenCL, omit 12571 lgkmcnt(0). 12572 - Use vmcnt(0) if atomic with 12573 return and vscnt(0) if 12574 atomic with no-return. 12575 - Must happen before 12576 following 12577 buffer_gl*_inv. 12578 - Ensures the 12579 atomicrmw has 12580 completed before 12581 invalidating the 12582 caches. 12583 12584 4. buffer_gl0_inv; 12585 buffer_gl1_inv 12586 12587 - Must happen before 12588 any following 12589 global/generic 12590 load/load 12591 atomic/atomicrmw. 12592 - Ensures that 12593 following loads 12594 will not see stale 12595 global data. 12596 12597 fence acq_rel - singlethread *none* *none* 12598 - wavefront 12599 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 12600 vmcnt(0) & vscnt(0) 12601 12602 - If CU wavefront execution 12603 mode, omit vmcnt(0) and 12604 vscnt(0). 12605 - If OpenCL and 12606 address space is 12607 not generic, omit 12608 lgkmcnt(0). 12609 - If OpenCL and 12610 address space is 12611 local, omit 12612 vmcnt(0) and vscnt(0). 12613 - However, 12614 since LLVM 12615 currently has no 12616 address space on 12617 the fence need to 12618 conservatively 12619 always generate 12620 (see comment for 12621 previous fence). 12622 - Could be split into 12623 separate s_waitcnt 12624 vmcnt(0), s_waitcnt 12625 vscnt(0) and s_waitcnt 12626 lgkmcnt(0) to allow 12627 them to be 12628 independently moved 12629 according to the 12630 following rules. 12631 - s_waitcnt vmcnt(0) 12632 must happen after 12633 any preceding 12634 global/generic 12635 load/load 12636 atomic/ 12637 atomicrmw-with-return-value. 12638 - s_waitcnt vscnt(0) 12639 must happen after 12640 any preceding 12641 global/generic 12642 store/store atomic/ 12643 atomicrmw-no-return-value. 12644 - s_waitcnt lgkmcnt(0) 12645 must happen after 12646 any preceding 12647 local/generic 12648 load/store/load 12649 atomic/store atomic/ 12650 atomicrmw. 12651 - Must happen before 12652 any following 12653 global/generic 12654 load/load 12655 atomic/store/store 12656 atomic/atomicrmw. 12657 - Ensures that all 12658 memory operations 12659 have 12660 completed before 12661 performing any 12662 following global 12663 memory operations. 12664 - Ensures that the 12665 preceding 12666 local/generic load 12667 atomic/atomicrmw 12668 with an equal or 12669 wider sync scope 12670 and memory ordering 12671 stronger than 12672 unordered (this is 12673 termed the 12674 acquire-fence-paired-atomic) 12675 has completed 12676 before following 12677 global memory 12678 operations. This 12679 satisfies the 12680 requirements of 12681 acquire. 12682 - Ensures that all 12683 previous memory 12684 operations have 12685 completed before a 12686 following 12687 local/generic store 12688 atomic/atomicrmw 12689 with an equal or 12690 wider sync scope 12691 and memory ordering 12692 stronger than 12693 unordered (this is 12694 termed the 12695 release-fence-paired-atomic). 12696 This satisfies the 12697 requirements of 12698 release. 12699 - Must happen before 12700 the following 12701 buffer_gl0_inv. 12702 - Ensures that the 12703 acquire-fence-paired 12704 atomic has completed 12705 before invalidating 12706 the 12707 cache. Therefore 12708 any following 12709 locations read must 12710 be no older than 12711 the value read by 12712 the 12713 acquire-fence-paired-atomic. 12714 12715 3. buffer_gl0_inv 12716 12717 - If CU wavefront execution 12718 mode, omit. 12719 - Ensures that 12720 following 12721 loads will not see 12722 stale data. 12723 12724 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 12725 - system vmcnt(0) & vscnt(0) 12726 12727 - If OpenCL and 12728 address space is 12729 not generic, omit 12730 lgkmcnt(0). 12731 - If OpenCL and 12732 address space is 12733 local, omit 12734 vmcnt(0) and vscnt(0). 12735 - However, since LLVM 12736 currently has no 12737 address space on 12738 the fence need to 12739 conservatively 12740 always generate 12741 (see comment for 12742 previous fence). 12743 - Could be split into 12744 separate s_waitcnt 12745 vmcnt(0), s_waitcnt 12746 vscnt(0) and s_waitcnt 12747 lgkmcnt(0) to allow 12748 them to be 12749 independently moved 12750 according to the 12751 following rules. 12752 - s_waitcnt vmcnt(0) 12753 must happen after 12754 any preceding 12755 global/generic 12756 load/load 12757 atomic/ 12758 atomicrmw-with-return-value. 12759 - s_waitcnt vscnt(0) 12760 must happen after 12761 any preceding 12762 global/generic 12763 store/store atomic/ 12764 atomicrmw-no-return-value. 12765 - s_waitcnt lgkmcnt(0) 12766 must happen after 12767 any preceding 12768 local/generic 12769 load/store/load 12770 atomic/store 12771 atomic/atomicrmw. 12772 - Must happen before 12773 the following 12774 buffer_gl*_inv. 12775 - Ensures that the 12776 preceding 12777 global/local/generic 12778 load 12779 atomic/atomicrmw 12780 with an equal or 12781 wider sync scope 12782 and memory ordering 12783 stronger than 12784 unordered (this is 12785 termed the 12786 acquire-fence-paired-atomic) 12787 has completed 12788 before invalidating 12789 the caches. This 12790 satisfies the 12791 requirements of 12792 acquire. 12793 - Ensures that all 12794 previous memory 12795 operations have 12796 completed before a 12797 following 12798 global/local/generic 12799 store 12800 atomic/atomicrmw 12801 with an equal or 12802 wider sync scope 12803 and memory ordering 12804 stronger than 12805 unordered (this is 12806 termed the 12807 release-fence-paired-atomic). 12808 This satisfies the 12809 requirements of 12810 release. 12811 12812 2. buffer_gl0_inv; 12813 buffer_gl1_inv 12814 12815 - Must happen before 12816 any following 12817 global/generic 12818 load/load 12819 atomic/store/store 12820 atomic/atomicrmw. 12821 - Ensures that 12822 following loads 12823 will not see stale 12824 global data. This 12825 satisfies the 12826 requirements of 12827 acquire. 12828 12829 **Sequential Consistent Atomic** 12830 ------------------------------------------------------------------------------------ 12831 load atomic seq_cst - singlethread - global *Same as corresponding 12832 - wavefront - local load atomic acquire, 12833 - generic except must generate 12834 all instructions even 12835 for OpenCL.* 12836 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 12837 - generic vmcnt(0) & vscnt(0) 12838 12839 - If CU wavefront execution 12840 mode, omit vmcnt(0) and 12841 vscnt(0). 12842 - Could be split into 12843 separate s_waitcnt 12844 vmcnt(0), s_waitcnt 12845 vscnt(0), and s_waitcnt 12846 lgkmcnt(0) to allow 12847 them to be 12848 independently moved 12849 according to the 12850 following rules. 12851 - s_waitcnt lgkmcnt(0) must 12852 happen after 12853 preceding 12854 local/generic load 12855 atomic/store 12856 atomic/atomicrmw 12857 with memory 12858 ordering of seq_cst 12859 and with equal or 12860 wider sync scope. 12861 (Note that seq_cst 12862 fences have their 12863 own s_waitcnt 12864 lgkmcnt(0) and so do 12865 not need to be 12866 considered.) 12867 - s_waitcnt vmcnt(0) 12868 must happen after 12869 preceding 12870 global/generic load 12871 atomic/ 12872 atomicrmw-with-return-value 12873 with memory 12874 ordering of seq_cst 12875 and with equal or 12876 wider sync scope. 12877 (Note that seq_cst 12878 fences have their 12879 own s_waitcnt 12880 vmcnt(0) and so do 12881 not need to be 12882 considered.) 12883 - s_waitcnt vscnt(0) 12884 Must happen after 12885 preceding 12886 global/generic store 12887 atomic/ 12888 atomicrmw-no-return-value 12889 with memory 12890 ordering of seq_cst 12891 and with equal or 12892 wider sync scope. 12893 (Note that seq_cst 12894 fences have their 12895 own s_waitcnt 12896 vscnt(0) and so do 12897 not need to be 12898 considered.) 12899 - Ensures any 12900 preceding 12901 sequential 12902 consistent global/local 12903 memory instructions 12904 have completed 12905 before executing 12906 this sequentially 12907 consistent 12908 instruction. This 12909 prevents reordering 12910 a seq_cst store 12911 followed by a 12912 seq_cst load. (Note 12913 that seq_cst is 12914 stronger than 12915 acquire/release as 12916 the reordering of 12917 load acquire 12918 followed by a store 12919 release is 12920 prevented by the 12921 s_waitcnt of 12922 the release, but 12923 there is nothing 12924 preventing a store 12925 release followed by 12926 load acquire from 12927 completing out of 12928 order. The s_waitcnt 12929 could be placed after 12930 seq_store or before 12931 the seq_load. We 12932 choose the load to 12933 make the s_waitcnt be 12934 as late as possible 12935 so that the store 12936 may have already 12937 completed.) 12938 12939 2. *Following 12940 instructions same as 12941 corresponding load 12942 atomic acquire, 12943 except must generate 12944 all instructions even 12945 for OpenCL.* 12946 load atomic seq_cst - workgroup - local 12947 12948 1. s_waitcnt vmcnt(0) & vscnt(0) 12949 12950 - If CU wavefront execution 12951 mode, omit. 12952 - Could be split into 12953 separate s_waitcnt 12954 vmcnt(0) and s_waitcnt 12955 vscnt(0) to allow 12956 them to be 12957 independently moved 12958 according to the 12959 following rules. 12960 - s_waitcnt vmcnt(0) 12961 Must happen after 12962 preceding 12963 global/generic load 12964 atomic/ 12965 atomicrmw-with-return-value 12966 with memory 12967 ordering of seq_cst 12968 and with equal or 12969 wider sync scope. 12970 (Note that seq_cst 12971 fences have their 12972 own s_waitcnt 12973 vmcnt(0) and so do 12974 not need to be 12975 considered.) 12976 - s_waitcnt vscnt(0) 12977 Must happen after 12978 preceding 12979 global/generic store 12980 atomic/ 12981 atomicrmw-no-return-value 12982 with memory 12983 ordering of seq_cst 12984 and with equal or 12985 wider sync scope. 12986 (Note that seq_cst 12987 fences have their 12988 own s_waitcnt 12989 vscnt(0) and so do 12990 not need to be 12991 considered.) 12992 - Ensures any 12993 preceding 12994 sequential 12995 consistent global 12996 memory instructions 12997 have completed 12998 before executing 12999 this sequentially 13000 consistent 13001 instruction. This 13002 prevents reordering 13003 a seq_cst store 13004 followed by a 13005 seq_cst load. (Note 13006 that seq_cst is 13007 stronger than 13008 acquire/release as 13009 the reordering of 13010 load acquire 13011 followed by a store 13012 release is 13013 prevented by the 13014 s_waitcnt of 13015 the release, but 13016 there is nothing 13017 preventing a store 13018 release followed by 13019 load acquire from 13020 completing out of 13021 order. The s_waitcnt 13022 could be placed after 13023 seq_store or before 13024 the seq_load. We 13025 choose the load to 13026 make the s_waitcnt be 13027 as late as possible 13028 so that the store 13029 may have already 13030 completed.) 13031 13032 2. *Following 13033 instructions same as 13034 corresponding load 13035 atomic acquire, 13036 except must generate 13037 all instructions even 13038 for OpenCL.* 13039 13040 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 13041 - system - generic vmcnt(0) & vscnt(0) 13042 13043 - Could be split into 13044 separate s_waitcnt 13045 vmcnt(0), s_waitcnt 13046 vscnt(0) and s_waitcnt 13047 lgkmcnt(0) to allow 13048 them to be 13049 independently moved 13050 according to the 13051 following rules. 13052 - s_waitcnt lgkmcnt(0) 13053 must happen after 13054 preceding 13055 local load 13056 atomic/store 13057 atomic/atomicrmw 13058 with memory 13059 ordering of seq_cst 13060 and with equal or 13061 wider sync scope. 13062 (Note that seq_cst 13063 fences have their 13064 own s_waitcnt 13065 lgkmcnt(0) and so do 13066 not need to be 13067 considered.) 13068 - s_waitcnt vmcnt(0) 13069 must happen after 13070 preceding 13071 global/generic load 13072 atomic/ 13073 atomicrmw-with-return-value 13074 with memory 13075 ordering of seq_cst 13076 and with equal or 13077 wider sync scope. 13078 (Note that seq_cst 13079 fences have their 13080 own s_waitcnt 13081 vmcnt(0) and so do 13082 not need to be 13083 considered.) 13084 - s_waitcnt vscnt(0) 13085 Must happen after 13086 preceding 13087 global/generic store 13088 atomic/ 13089 atomicrmw-no-return-value 13090 with memory 13091 ordering of seq_cst 13092 and with equal or 13093 wider sync scope. 13094 (Note that seq_cst 13095 fences have their 13096 own s_waitcnt 13097 vscnt(0) and so do 13098 not need to be 13099 considered.) 13100 - Ensures any 13101 preceding 13102 sequential 13103 consistent global 13104 memory instructions 13105 have completed 13106 before executing 13107 this sequentially 13108 consistent 13109 instruction. This 13110 prevents reordering 13111 a seq_cst store 13112 followed by a 13113 seq_cst load. (Note 13114 that seq_cst is 13115 stronger than 13116 acquire/release as 13117 the reordering of 13118 load acquire 13119 followed by a store 13120 release is 13121 prevented by the 13122 s_waitcnt of 13123 the release, but 13124 there is nothing 13125 preventing a store 13126 release followed by 13127 load acquire from 13128 completing out of 13129 order. The s_waitcnt 13130 could be placed after 13131 seq_store or before 13132 the seq_load. We 13133 choose the load to 13134 make the s_waitcnt be 13135 as late as possible 13136 so that the store 13137 may have already 13138 completed.) 13139 13140 2. *Following 13141 instructions same as 13142 corresponding load 13143 atomic acquire, 13144 except must generate 13145 all instructions even 13146 for OpenCL.* 13147 store atomic seq_cst - singlethread - global *Same as corresponding 13148 - wavefront - local store atomic release, 13149 - workgroup - generic except must generate 13150 - agent all instructions even 13151 - system for OpenCL.* 13152 atomicrmw seq_cst - singlethread - global *Same as corresponding 13153 - wavefront - local atomicrmw acq_rel, 13154 - workgroup - generic except must generate 13155 - agent all instructions even 13156 - system for OpenCL.* 13157 fence seq_cst - singlethread *none* *Same as corresponding 13158 - wavefront fence acq_rel, 13159 - workgroup except must generate 13160 - agent all instructions even 13161 - system for OpenCL.* 13162 ============ ============ ============== ========== ================================ 13163 13164.. _amdgpu-amdhsa-trap-handler-abi: 13165 13166Trap Handler ABI 13167~~~~~~~~~~~~~~~~ 13168 13169For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 13170runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 13171supports the ``s_trap`` instruction. For usage see: 13172 13173- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 13174- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 13175- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table` 13176 13177 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 13178 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 13179 13180 =================== =============== =============== ======================================= 13181 Usage Code Sequence Trap Handler Description 13182 Inputs 13183 =================== =============== =============== ======================================= 13184 reserved ``s_trap 0x00`` Reserved by hardware. 13185 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 13186 ``queue_ptr`` intrinsic (not implemented). 13187 ``VGPR0``: 13188 ``arg`` 13189 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13190 ``queue_ptr`` the trap instruction. The associated 13191 queue is signalled to put it into the 13192 error state. When the queue is put in 13193 the error state, the waves executing 13194 dispatches on the queue will be 13195 terminated. 13196 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13197 as a no-operation. The trap handler 13198 is entered and immediately returns to 13199 continue execution of the wavefront. 13200 - If the debugger is enabled, causes 13201 the debug trap to be reported by the 13202 debugger and the wavefront is put in 13203 the halt state with the PC at the 13204 instruction. The debugger must 13205 increment the PC and resume the wave. 13206 reserved ``s_trap 0x04`` Reserved. 13207 reserved ``s_trap 0x05`` Reserved. 13208 reserved ``s_trap 0x06`` Reserved. 13209 reserved ``s_trap 0x07`` Reserved. 13210 reserved ``s_trap 0x08`` Reserved. 13211 reserved ``s_trap 0xfe`` Reserved. 13212 reserved ``s_trap 0xff`` Reserved. 13213 =================== =============== =============== ======================================= 13214 13215.. 13216 13217 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 13218 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 13219 13220 =================== =============== =============== ======================================= 13221 Usage Code Sequence Trap Handler Description 13222 Inputs 13223 =================== =============== =============== ======================================= 13224 reserved ``s_trap 0x00`` Reserved by hardware. 13225 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 13226 breakpoints. Causes wave to be halted 13227 with the PC at the trap instruction. 13228 The debugger is responsible to resume 13229 the wave, including the instruction 13230 that the breakpoint overwrote. 13231 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 13232 ``queue_ptr`` the trap instruction. The associated 13233 queue is signalled to put it into the 13234 error state. When the queue is put in 13235 the error state, the waves executing 13236 dispatches on the queue will be 13237 terminated. 13238 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 13239 as a no-operation. The trap handler 13240 is entered and immediately returns to 13241 continue execution of the wavefront. 13242 - If the debugger is enabled, causes 13243 the debug trap to be reported by the 13244 debugger and the wavefront is put in 13245 the halt state with the PC at the 13246 instruction. The debugger must 13247 increment the PC and resume the wave. 13248 reserved ``s_trap 0x04`` Reserved. 13249 reserved ``s_trap 0x05`` Reserved. 13250 reserved ``s_trap 0x06`` Reserved. 13251 reserved ``s_trap 0x07`` Reserved. 13252 reserved ``s_trap 0x08`` Reserved. 13253 reserved ``s_trap 0xfe`` Reserved. 13254 reserved ``s_trap 0xff`` Reserved. 13255 =================== =============== =============== ======================================= 13256 13257.. 13258 13259 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above 13260 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table 13261 13262 =================== =============== ================ ================= ======================================= 13263 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description 13264 =================== =============== ================ ================= ======================================= 13265 reserved ``s_trap 0x00`` Reserved by hardware. 13266 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 13267 breakpoints. Causes wave to be halted 13268 with the PC at the trap instruction. 13269 The debugger is responsible to resume 13270 the wave, including the instruction 13271 that the breakpoint overwrote. 13272 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 13273 ``queue_ptr`` the trap instruction. The associated 13274 queue is signalled to put it into the 13275 error state. When the queue is put in 13276 the error state, the waves executing 13277 dispatches on the queue will be 13278 terminated. 13279 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 13280 as a no-operation. The trap handler 13281 is entered and immediately returns to 13282 continue execution of the wavefront. 13283 - If the debugger is enabled, causes 13284 the debug trap to be reported by the 13285 debugger and the wavefront is put in 13286 the halt state with the PC at the 13287 instruction. The debugger must 13288 increment the PC and resume the wave. 13289 reserved ``s_trap 0x04`` Reserved. 13290 reserved ``s_trap 0x05`` Reserved. 13291 reserved ``s_trap 0x06`` Reserved. 13292 reserved ``s_trap 0x07`` Reserved. 13293 reserved ``s_trap 0x08`` Reserved. 13294 reserved ``s_trap 0xfe`` Reserved. 13295 reserved ``s_trap 0xff`` Reserved. 13296 =================== =============== ================ ================= ======================================= 13297 13298.. _amdgpu-amdhsa-function-call-convention: 13299 13300Call Convention 13301~~~~~~~~~~~~~~~ 13302 13303.. note:: 13304 13305 This section is currently incomplete and has inaccuracies. It is WIP that will 13306 be updated as information is determined. 13307 13308See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 13309addresses. Unswizzled addresses are normal linear addresses. 13310 13311.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 13312 13313Kernel Functions 13314++++++++++++++++ 13315 13316This section describes the call convention ABI for the outer kernel function. 13317 13318See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 13319convention. 13320 13321The following is not part of the AMDGPU kernel calling convention but describes 13322how the AMDGPU implements function calls: 13323 133241. Clang decides the kernarg layout to match the *HSA Programmer's Language 13325 Reference* [HSA]_. 13326 13327 - All structs are passed directly. 13328 - Lambda values are passed *TBA*. 13329 13330 .. TODO:: 13331 13332 - Does this really follow HSA rules? Or are structs >16 bytes passed 13333 by-value struct? 13334 - What is ABI for lambda values? 13335 133364. The kernel performs certain setup in its prolog, as described in 13337 :ref:`amdgpu-amdhsa-kernel-prolog`. 13338 13339.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 13340 13341Non-Kernel Functions 13342++++++++++++++++++++ 13343 13344This section describes the call convention ABI for functions other than the 13345outer kernel function. 13346 13347If a kernel has function calls then scratch is always allocated and used for 13348the call stack which grows from low address to high address using the swizzled 13349scratch address space. 13350 13351On entry to a function: 13352 133531. SGPR0-3 contain a V# with the following properties (see 13354 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 13355 13356 * Base address pointing to the beginning of the wavefront scratch backing 13357 memory. 13358 * Swizzled with dword element size and stride of wavefront size elements. 13359 133602. The FLAT_SCRATCH register pair is setup. See 13361 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 133623. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 13363 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 133644. The EXEC register is set to the lanes active on entry to the function. 133655. MODE register: *TBD* 133666. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 13367 below. 133687. SGPR30-31 return address (RA). The code address that the function must 13369 return to when it completes. The value is undefined if the function is *no 13370 return*. 133718. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 13372 offset relative to the beginning of the wavefront scratch backing memory. 13373 13374 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 13375 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 13376 manner. 13377 13378 The unswizzled SP value can be converted into the swizzled SP value by: 13379 13380 | swizzled SP = unswizzled SP / wavefront size 13381 13382 This may be used to obtain the private address space address of stack 13383 objects and to convert this address to a flat address by adding the flat 13384 scratch aperture base address. 13385 13386 The swizzled SP value is always 4 bytes aligned for the ``r600`` 13387 architecture and 16 byte aligned for the ``amdgcn`` architecture. 13388 13389 .. note:: 13390 13391 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 13392 OpenCL language which has the largest base type defined as 16 bytes. 13393 13394 On entry, the swizzled SP value is the address of the first function 13395 argument passed on the stack. Other stack passed arguments are positive 13396 offsets from the entry swizzled SP value. 13397 13398 The function may use positive offsets beyond the last stack passed argument 13399 for stack allocated local variables and register spill slots. If necessary, 13400 the function may align these to greater alignment than 16 bytes. After these 13401 the function may dynamically allocate space for such things as runtime sized 13402 ``alloca`` local allocations. 13403 13404 If the function calls another function, it will place any stack allocated 13405 arguments after the last local allocation and adjust SGPR32 to the address 13406 after the last local allocation. 13407 134089. All other registers are unspecified. 1340910. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 13410 to the function. 13411 13412On exit from a function: 13413 134141. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 13415 described below. Any registers used are considered clobbered registers. 134162. The following registers are preserved and have the same value as on entry: 13417 13418 * FLAT_SCRATCH 13419 * EXEC 13420 * GFX6-GFX8: M0 13421 * All SGPR registers except the clobbered registers of SGPR4-31. 13422 * VGPR40-47 13423 * VGPR56-63 13424 * VGPR72-79 13425 * VGPR88-95 13426 * VGPR104-111 13427 * VGPR120-127 13428 * VGPR136-143 13429 * VGPR152-159 13430 * VGPR168-175 13431 * VGPR184-191 13432 * VGPR200-207 13433 * VGPR216-223 13434 * VGPR232-239 13435 * VGPR248-255 13436 13437 .. note:: 13438 13439 Except the argument registers, the VGPRs clobbered and the preserved 13440 registers are intermixed at regular intervals in order to keep a 13441 similar ratio independent of the number of allocated VGPRs. 13442 13443 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31. 13444 * Lanes of all VGPRs that are inactive at the call site. 13445 13446 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 13447 optimization may mark some of clobbered SGPR and VGPR registers as 13448 preserved if it can be determined that the called function does not change 13449 their value. 13450 134512. The PC is set to the RA provided on entry. 134523. MODE register: *TBD*. 134534. All other registers are clobbered. 134545. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 13455 function is available to the caller. 13456 13457.. TODO:: 13458 13459 - How are function results returned? The address of structured types is passed 13460 by reference, but what about other types? 13461 13462The function input arguments are made up of the formal arguments explicitly 13463declared by the source language function plus the implicit input arguments used 13464by the implementation. 13465 13466The source language input arguments are: 13467 134681. Any source language implicit ``this`` or ``self`` argument comes first as a 13469 pointer type. 134702. Followed by the function formal arguments in left to right source order. 13471 13472The source language result arguments are: 13473 134741. The function result argument. 13475 13476The source language input or result struct type arguments that are less than or 13477equal to 16 bytes, are decomposed recursively into their base type fields, and 13478each field is passed as if a separate argument. For input arguments, if the 13479called function requires the struct to be in memory, for example because its 13480address is taken, then the function body is responsible for allocating a stack 13481location and copying the field arguments into it. Clang terms this *direct 13482struct*. 13483 13484The source language input struct type arguments that are greater than 16 bytes, 13485are passed by reference. The caller is responsible for allocating a stack 13486location to make a copy of the struct value and pass the address as the input 13487argument. The called function is responsible to perform the dereference when 13488accessing the input argument. Clang terms this *by-value struct*. 13489 13490A source language result struct type argument that is greater than 16 bytes, is 13491returned by reference. The caller is responsible for allocating a stack location 13492to hold the result value and passes the address as the last input argument 13493(before the implicit input arguments). In this case there are no result 13494arguments. The called function is responsible to perform the dereference when 13495storing the result value. Clang terms this *structured return (sret)*. 13496 13497*TODO: correct the ``sret`` definition.* 13498 13499.. TODO:: 13500 13501 Is this definition correct? Or is ``sret`` only used if passing in registers, and 13502 pass as non-decomposed struct as stack argument? Or something else? Is the 13503 memory location in the caller stack frame, or a stack memory argument and so 13504 no address is passed as the caller can directly write to the argument stack 13505 location? But then the stack location is still live after return. If an 13506 argument stack location is it the first stack argument or the last one? 13507 13508Lambda argument types are treated as struct types with an implementation defined 13509set of fields. 13510 13511.. TODO:: 13512 13513 Need to specify the ABI for lambda types for AMDGPU. 13514 13515For AMDGPU backend all source language arguments (including the decomposed 13516struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 13517they are passed in SGPRs. 13518 13519The AMDGPU backend walks the function call graph from the leaves to determine 13520which implicit input arguments are used, propagating to each caller of the 13521function. The used implicit arguments are appended to the function arguments 13522after the source language arguments in the following order: 13523 13524.. TODO:: 13525 13526 Is recursion or external functions supported? 13527 135281. Work-Item ID (1 VGPR) 13529 13530 The X, Y and Z work-item ID are packed into a single VGRP with the following 13531 layout. Only fields actually used by the function are set. The other bits 13532 are undefined. 13533 13534 The values come from the initial kernel execution state. See 13535 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 13536 13537 .. table:: Work-item implicit argument layout 13538 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 13539 13540 ======= ======= ============== 13541 Bits Size Field Name 13542 ======= ======= ============== 13543 9:0 10 bits X Work-Item ID 13544 19:10 10 bits Y Work-Item ID 13545 29:20 10 bits Z Work-Item ID 13546 31:30 2 bits Unused 13547 ======= ======= ============== 13548 135492. Dispatch Ptr (2 SGPRs) 13550 13551 The value comes from the initial kernel execution state. See 13552 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13553 135543. Queue Ptr (2 SGPRs) 13555 13556 The value comes from the initial kernel execution state. See 13557 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13558 135594. Kernarg Segment Ptr (2 SGPRs) 13560 13561 The value comes from the initial kernel execution state. See 13562 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13563 135645. Dispatch id (2 SGPRs) 13565 13566 The value comes from the initial kernel execution state. See 13567 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13568 135696. Work-Group ID X (1 SGPR) 13570 13571 The value comes from the initial kernel execution state. See 13572 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13573 135747. Work-Group ID Y (1 SGPR) 13575 13576 The value comes from the initial kernel execution state. See 13577 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13578 135798. Work-Group ID Z (1 SGPR) 13580 13581 The value comes from the initial kernel execution state. See 13582 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 13583 135849. Implicit Argument Ptr (2 SGPRs) 13585 13586 The value is computed by adding an offset to Kernarg Segment Ptr to get the 13587 global address space pointer to the first kernarg implicit argument. 13588 13589The input and result arguments are assigned in order in the following manner: 13590 13591.. note:: 13592 13593 There are likely some errors and omissions in the following description that 13594 need correction. 13595 13596 .. TODO:: 13597 13598 Check the Clang source code to decipher how function arguments and return 13599 results are handled. Also see the AMDGPU specific values used. 13600 13601* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 13602 VGPR31. 13603 13604 If there are more arguments than will fit in these registers, the remaining 13605 arguments are allocated on the stack in order on naturally aligned 13606 addresses. 13607 13608 .. TODO:: 13609 13610 How are overly aligned structures allocated on the stack? 13611 13612* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 13613 SGPR29. 13614 13615 If there are more arguments than will fit in these registers, the remaining 13616 arguments are allocated on the stack in order on naturally aligned 13617 addresses. 13618 13619Note that decomposed struct type arguments may have some fields passed in 13620registers and some in memory. 13621 13622.. TODO:: 13623 13624 So, a struct which can pass some fields as decomposed register arguments, will 13625 pass the rest as decomposed stack elements? But an argument that will not start 13626 in registers will not be decomposed and will be passed as a non-decomposed 13627 stack value? 13628 13629The following is not part of the AMDGPU function calling convention but 13630describes how the AMDGPU implements function calls: 13631 136321. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 13633 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 13634 are used, or for the reasons defined in ``SIFrameLowering``. 136352. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 13636 to access the incoming stack arguments in the function. The BP is needed 13637 only when the function requires the runtime stack alignment. 13638 136393. Allocating SGPR arguments on the stack are not supported. 13640 136414. No CFI is currently generated. See 13642 :ref:`amdgpu-dwarf-call-frame-information`. 13643 13644 .. note:: 13645 13646 CFI will be generated that defines the CFA as the unswizzled address 13647 relative to the wave scratch base in the unswizzled private address space 13648 of the lowest address stack allocated local variable. 13649 13650 ``DW_AT_frame_base`` will be defined as the swizzled address in the 13651 swizzled private address space by dividing the CFA by the wavefront size 13652 (since CFA is always at least dword aligned which matches the scratch 13653 swizzle element size). 13654 13655 If no dynamic stack alignment was performed, the stack allocated arguments 13656 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 13657 local variables and register spill slots are accessed as positive offsets 13658 relative to ``DW_AT_frame_base``. 13659 136605. Function argument passing is implemented by copying the input physical 13661 registers to virtual registers on entry. The register allocator can spill if 13662 necessary. These are copied back to physical registers at call sites. The 13663 net effect is that each function call can have these values in entirely 13664 distinct locations. The IPRA can help avoid shuffling argument registers. 136656. Call sites are implemented by setting up the arguments at positive offsets 13666 from SP. Then SP is incremented to account for the known frame size before 13667 the call and decremented after the call. 13668 13669 .. note:: 13670 13671 The CFI will reflect the changed calculation needed to compute the CFA 13672 from SP. 13673 136747. 4 byte spill slots are used in the stack frame. One slot is allocated for an 13675 emergency spill slot. Buffer instructions are used for stack accesses and 13676 not the ``flat_scratch`` instruction. 13677 13678 .. TODO:: 13679 13680 Explain when the emergency spill slot is used. 13681 13682.. TODO:: 13683 13684 Possible broken issues: 13685 13686 - Stack arguments must be aligned to required alignment. 13687 - Stack is aligned to max(16, max formal argument alignment) 13688 - Direct argument < 64 bits should check register budget. 13689 - Register budget calculation should respect ``inreg`` for SGPR. 13690 - SGPR overflow is not handled. 13691 - struct with 1 member unpeeling is not checking size of member. 13692 - ``sret`` is after ``this`` pointer. 13693 - Caller is not implementing stack realignment: need an extra pointer. 13694 - Should say AMDGPU passes FP rather than SP. 13695 - Should CFI define CFA as address of locals or arguments. Difference is 13696 apparent when have implemented dynamic alignment. 13697 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 13698 highest address of stack frame and use negative offset for locals. Would 13699 allow SP to be the same as FP and could support signal-handler-like as now 13700 have a real SP for the top of the stack. 13701 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 13702 arguments? 13703 13704AMDPAL 13705------ 13706 13707This section provides code conventions used when the target triple OS is 13708``amdpal`` (see :ref:`amdgpu-target-triples`). 13709 13710.. _amdgpu-amdpal-code-object-metadata-section: 13711 13712Code Object Metadata 13713~~~~~~~~~~~~~~~~~~~~ 13714 13715.. note:: 13716 13717 The metadata is currently in development and is subject to major 13718 changes. Only the current version is supported. *When this document 13719 was generated the version was 2.6.* 13720 13721Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 13722record (see :ref:`amdgpu-note-records-v3-onwards`). 13723 13724The metadata is represented as Message Pack formatted binary data (see 13725[MsgPack]_). The top level is a Message Pack map that includes the keys 13726defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 13727and referenced tables. 13728 13729Additional information can be added to the maps. To avoid conflicts, any 13730key names should be prefixed by "*vendor-name*." where ``vendor-name`` 13731can be the name of the vendor and specific vendor tool that generates the 13732information. The prefix is abbreviated to simply "." when it appears 13733within a map that has been added by the same *vendor-name*. 13734 13735 .. table:: AMDPAL Code Object Metadata Map 13736 :name: amdgpu-amdpal-code-object-metadata-map-table 13737 13738 =================== ============== ========= ====================================================================== 13739 String Key Value Type Required? Description 13740 =================== ============== ========= ====================================================================== 13741 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 13742 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 13743 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 13744 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 13745 definition of the keys included in that map. 13746 =================== ============== ========= ====================================================================== 13747 13748.. 13749 13750 .. table:: AMDPAL Code Object Pipeline Metadata Map 13751 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 13752 13753 ====================================== ============== ========= =================================================== 13754 String Key Value Type Required? Description 13755 ====================================== ============== ========= =================================================== 13756 ".name" string Source name of the pipeline. 13757 ".type" string Pipeline type, e.g. VsPs. Values include: 13758 13759 - "VsPs" 13760 - "Gs" 13761 - "Cs" 13762 - "Ngg" 13763 - "Tess" 13764 - "GsTess" 13765 - "NggTess" 13766 13767 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 13768 2 integers 64 bits is the "stable" portion of the hash, used 13769 for e.g. shader replacement lookup. Upper 64 bits 13770 is the "unique" portion of the hash, used for 13771 e.g. pipeline cache lookup. The value is 13772 implementation defined, and can not be relied on 13773 between different builds of the compiler. 13774 ".shaders" map Per-API shader metadata. See 13775 :ref:`amdgpu-amdpal-code-object-shader-map-table` 13776 for the definition of the keys included in that 13777 map. 13778 ".hardware_stages" map Per-hardware stage metadata. See 13779 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 13780 for the definition of the keys included in that 13781 map. 13782 ".shader_functions" map Per-shader function metadata. See 13783 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 13784 for the definition of the keys included in that 13785 map. 13786 ".registers" map Required Hardware register configuration. See 13787 :ref:`amdgpu-amdpal-code-object-register-map-table` 13788 for the definition of the keys included in that 13789 map. 13790 ".user_data_limit" integer Number of user data entries accessed by this 13791 pipeline. 13792 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 13793 NoUserDataSpilling. 13794 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 13795 viewport array index feature. Pipelines which use 13796 this feature can render into all 16 viewports, 13797 whereas pipelines which do not use it are 13798 restricted to viewport #0. 13799 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 13800 handling data-passing between the ES and GS 13801 shader stages. This can be zero if the data is 13802 passed using off-chip buffers. This value should 13803 be used to program all user-SGPRs which have been 13804 marked with "UserDataMapping::EsGsLdsSize" 13805 (typically only the GS and VS HW stages will ever 13806 have a user-SGPR so marked). 13807 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 13808 (maximum number of threads in a subgroup). 13809 ".num_interpolants" integer Graphics only. Number of PS interpolants. 13810 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 13811 ".api" string Name of the client graphics API. 13812 ".api_create_info" binary Graphics API shader create info binary blob. Can 13813 be defined by the driver using the compiler if 13814 they want to be able to correlate API-specific 13815 information used during creation at a later time. 13816 ====================================== ============== ========= =================================================== 13817 13818.. 13819 13820 .. table:: AMDPAL Code Object Shader Map 13821 :name: amdgpu-amdpal-code-object-shader-map-table 13822 13823 13824 +-------------+--------------+-------------------------------------------------------------------+ 13825 |String Key |Value Type |Description | 13826 +=============+==============+===================================================================+ 13827 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 13828 |- ".vertex" | |for the definition of the keys included in that map. | 13829 |- ".hull" | | | 13830 |- ".domain" | | | 13831 |- ".geometry"| | | 13832 |- ".pixel" | | | 13833 +-------------+--------------+-------------------------------------------------------------------+ 13834 13835.. 13836 13837 .. table:: AMDPAL Code Object API Shader Metadata Map 13838 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 13839 13840 ==================== ============== ========= ===================================================================== 13841 String Key Value Type Required? Description 13842 ==================== ============== ========= ===================================================================== 13843 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 13844 2 integers is implementation defined, and can not be relied on between 13845 different builds of the compiler. 13846 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 13847 string include: 13848 13849 - ".ls" 13850 - ".hs" 13851 - ".es" 13852 - ".gs" 13853 - ".vs" 13854 - ".ps" 13855 - ".cs" 13856 13857 ==================== ============== ========= ===================================================================== 13858 13859.. 13860 13861 .. table:: AMDPAL Code Object Hardware Stage Map 13862 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 13863 13864 +-------------+--------------+-----------------------------------------------------------------------+ 13865 |String Key |Value Type |Description | 13866 +=============+==============+=======================================================================+ 13867 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 13868 |- ".hs" | |for the definition of the keys included in that map. | 13869 |- ".es" | | | 13870 |- ".gs" | | | 13871 |- ".vs" | | | 13872 |- ".ps" | | | 13873 |- ".cs" | | | 13874 +-------------+--------------+-----------------------------------------------------------------------+ 13875 13876.. 13877 13878 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 13879 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 13880 13881 ========================== ============== ========= =============================================================== 13882 String Key Value Type Required? Description 13883 ========================== ============== ========= =============================================================== 13884 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 13885 ".scratch_memory_size" integer Scratch memory size in bytes. 13886 ".lds_size" integer Local Data Share size in bytes. 13887 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 13888 ".vgpr_count" integer Number of VGPRs used. 13889 ".agpr_count" integer Number of AGPRs used. 13890 ".sgpr_count" integer Number of SGPRs used. 13891 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 13892 directive to instruct the compiler to limit the VGPR usage to 13893 be less than or equal to the specified value (only set if 13894 different from HW default). 13895 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 13896 default). 13897 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 13898 3 integers 13899 ".wavefront_size" integer Wavefront size (only set if different from HW default). 13900 ".uses_uavs" boolean The shader reads or writes UAVs. 13901 ".uses_rovs" boolean The shader reads or writes ROVs. 13902 ".writes_uavs" boolean The shader writes to one or more UAVs. 13903 ".writes_depth" boolean The shader writes out a depth value. 13904 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 13905 memory or GDS. 13906 ".uses_prim_id" boolean The shader uses PrimID. 13907 ========================== ============== ========= =============================================================== 13908 13909.. 13910 13911 .. table:: AMDPAL Code Object Shader Function Map 13912 :name: amdgpu-amdpal-code-object-shader-function-map-table 13913 13914 =============== ============== ==================================================================== 13915 String Key Value Type Description 13916 =============== ============== ==================================================================== 13917 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 13918 entry address. The value is the function's metadata. See 13919 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 13920 =============== ============== ==================================================================== 13921 13922.. 13923 13924 .. table:: AMDPAL Code Object Shader Function Metadata Map 13925 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 13926 13927 ============================= ============== ================================================================= 13928 String Key Value Type Description 13929 ============================= ============== ================================================================= 13930 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 13931 2 integers is implementation defined, and can not be relied on between 13932 different builds of the compiler. 13933 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 13934 ".lds_size" integer Size in bytes of LDS memory. 13935 ".vgpr_count" integer Number of VGPRs used by the shader. 13936 ".sgpr_count" integer Number of SGPRs used by the shader. 13937 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 13938 ".shader_subtype" string Shader subtype/kind. Values include: 13939 13940 - "Unknown" 13941 13942 ============================= ============== ================================================================= 13943 13944.. 13945 13946 .. table:: AMDPAL Code Object Register Map 13947 :name: amdgpu-amdpal-code-object-register-map-table 13948 13949 ========================== ============== ==================================================================== 13950 32-bit Integer Key Value Type Description 13951 ========================== ============== ==================================================================== 13952 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 13953 a GRBM register (i.e., driver accessible GPU register number, not 13954 shader GPR register number). The driver is required to program each 13955 specified register to the corresponding specified value when 13956 executing this pipeline. Typically, the ``reg offsets`` are the 13957 ``uint16_t`` offsets to each register as defined by the hardware 13958 chip headers. The register is set to the provided value. However, a 13959 ``reg offset`` that specifies a user data register (e.g., 13960 COMPUTE_USER_DATA_0) needs special treatment. See 13961 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 13962 information. 13963 ========================== ============== ==================================================================== 13964 13965.. _amdgpu-amdpal-code-object-user-data-section: 13966 13967User Data 13968+++++++++ 13969 13970Each hardware stage has a set of 32-bit physical SPI *user data registers* 13971(either 16 or 32 based on graphics IP and the stage) which can be 13972written from a command buffer and then loaded into SGPRs when waves are 13973launched via a subsequent dispatch or draw operation. This is the way 13974most arguments are passed from the application/runtime to a hardware 13975shader. 13976 13977PAL abstracts this functionality by exposing a set of 128 *user data 13978entries* per pipeline a client can use to pass arguments from a command 13979buffer to one or more shaders in that pipeline. The ELF code object must 13980specify a mapping from virtualized *user data entries* to physical *user 13981data registers*, and PAL is responsible for implementing that mapping, 13982including spilling overflow *user data entries* to memory if needed. 13983 13984Since the *user data registers* are GRBM-accessible SPI registers, this 13985mapping is actually embedded in the ``.registers`` metadata entry. For 13986most registers, the value in that map is a literal 32-bit value that 13987should be written to the register by the driver. However, when the 13988register is a *user data register* (any USER_DATA register e.g., 13989SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 13990the driver to write either a *user data entry* value or one of several 13991driver-internal values to the register. This encoding is described in 13992the following table: 13993 13994.. note:: 13995 13996 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 13997 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 13998 always be programmed to the address of the GlobalTable, and *user data 13999 register* 1 must always be programmed to the address of the PerShaderTable. 14000 14001.. 14002 14003 .. table:: AMDPAL User Data Mapping 14004 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 14005 14006 ========== ================= =============================================================================== 14007 Value Name Description 14008 ========== ================= =============================================================================== 14009 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 14010 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 14011 always point to *user data register* 0). 14012 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 14013 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 14014 for more detail (should always point to *user data register* 1). 14015 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 14016 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 14017 more detail. 14018 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 14019 reference the draw index in the vertex shader. Only supported by the first 14020 stage in a graphics pipeline. 14021 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 14022 a graphics pipeline. 14023 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 14024 graphics pipeline. 14025 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 14026 a buffer containing the grid dimensions for a Compute dispatch operation. The 14027 high half of the address is stored in the next sequential user-SGPR. Only 14028 supported by compute pipelines. 14029 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 14030 space used for the ES/GS pseudo-ring-buffer for passing data between shader 14031 stages. 14032 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 14033 pipeline instancing. 14034 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 14035 can only appear for one shader stage per pipeline. 14036 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 14037 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 14038 only appear for one shader stage per pipeline. 14039 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 14040 only appear for one shader stage per pipeline (PS). These replace color targets 14041 and are completely separate from any UAVs used by the shader. This is optional, 14042 and only used by the PS when UAV exports are used to replace color-target 14043 exports to optimize specific shaders. 14044 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 14045 some NGG pipelines to perform culling. This value contains the address of the 14046 first of two consecutive registers which provide the full GPU address. 14047 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 14048 ========== ================= =============================================================================== 14049 14050.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 14051 14052Per-Shader Table 14053################ 14054 14055Low 32 bits of the GPU address for an optional buffer in the ``.data`` 14056section of the ELF. The high 32 bits of the address match the high 32 bits 14057of the shader's program counter. 14058 14059The buffer can be anything the shader compiler needs it for, and 14060allows each shader to have its own region of the ``.data`` section. 14061Typically, this could be a table of buffer SRD's and the data pointed to 14062by the buffer SRD's, but it could be a flat-address region of memory as 14063well. Its layout and usage are defined by the shader compiler. 14064 14065Each shader's table in the ``.data`` section is referenced by the symbol 14066``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 14067hardware shader stage the data is for. E.g., 14068``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 14069 14070.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 14071 14072Spill Table 14073########### 14074 14075It is possible for a hardware shader to need access to more *user data 14076entries* than there are slots available in user data registers for one 14077or more hardware shader stages. In that case, the PAL runtime expects 14078the necessary *user data entries* to be spilled to GPU memory and use 14079one user data register to point to the spilled user data memory. The 14080value of the *user data entry* must then represent the location where 14081a shader expects to read the low 32-bits of the table's GPU virtual 14082address. The *spill table* itself represents a set of 32-bit values 14083managed by the PAL runtime in GPU-accessible memory that can be made 14084indirectly accessible to a hardware shader. 14085 14086Unspecified OS 14087-------------- 14088 14089This section provides code conventions used when the target triple OS is 14090empty (see :ref:`amdgpu-target-triples`). 14091 14092Trap Handler ABI 14093~~~~~~~~~~~~~~~~ 14094 14095For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 14096not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 14097instructions are handled as follows: 14098 14099 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 14100 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 14101 14102 =============== =============== =========================================== 14103 Usage Code Sequence Description 14104 =============== =============== =========================================== 14105 llvm.trap s_endpgm Causes wavefront to be terminated. 14106 llvm.debugtrap *none* Compiler warning given that there is no 14107 trap handler installed. 14108 =============== =============== =========================================== 14109 14110Source Languages 14111================ 14112 14113.. _amdgpu-opencl: 14114 14115OpenCL 14116------ 14117 14118When the language is OpenCL the following differences occur: 14119 141201. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 141212. The AMDGPU backend appends additional arguments to the kernel's explicit 14122 arguments for the AMDHSA OS (see 14123 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 141243. Additional metadata is generated 14125 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 14126 14127 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 14128 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 14129 14130 ======== ==== ========= =========================================== 14131 Position Byte Byte Description 14132 Size Alignment 14133 ======== ==== ========= =========================================== 14134 1 8 8 OpenCL Global Offset X 14135 2 8 8 OpenCL Global Offset Y 14136 3 8 8 OpenCL Global Offset Z 14137 4 8 8 OpenCL address of printf buffer 14138 5 8 8 OpenCL address of virtual queue used by 14139 enqueue_kernel. 14140 6 8 8 OpenCL address of AqlWrap struct used by 14141 enqueue_kernel. 14142 7 8 8 Pointer argument used for Multi-gird 14143 synchronization. 14144 ======== ==== ========= =========================================== 14145 14146.. _amdgpu-hcc: 14147 14148HCC 14149--- 14150 14151When the language is HCC the following differences occur: 14152 141531. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 14154 14155.. _amdgpu-assembler: 14156 14157Assembler 14158--------- 14159 14160AMDGPU backend has LLVM-MC based assembler which is currently in development. 14161It supports AMDGCN GFX6-GFX10. 14162 14163This section describes general syntax for instructions and operands. 14164 14165Instructions 14166~~~~~~~~~~~~ 14167 14168An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 14169 14170 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 14171 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 14172 14173:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 14174:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 14175 14176The order of operands and modifiers is fixed. 14177Most modifiers are optional and may be omitted. 14178 14179Links to detailed instruction syntax description may be found in the following 14180table. Note that features under development are not included 14181in this description. 14182 14183 =================================== ======================================= 14184 Core ISA ISA Extensions 14185 =================================== ======================================= 14186 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 14187 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 14188 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 14189 14190 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 14191 14192 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 14193 14194 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 14195 14196 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 14197 14198 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 14199 14200 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 14201 14202 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 14203 14204 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 14205 =================================== ======================================= 14206 14207For more information about instructions, their semantics and supported 14208combinations of operands, refer to one of instruction set architecture manuals 14209[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 14210[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_ 14211[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_. 14212 14213Operands 14214~~~~~~~~ 14215 14216Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 14217 14218Modifiers 14219~~~~~~~~~ 14220 14221Detailed description of modifiers may be found 14222:doc:`here<AMDGPUModifierSyntax>`. 14223 14224Instruction Examples 14225~~~~~~~~~~~~~~~~~~~~ 14226 14227DS 14228++ 14229 14230.. code-block:: nasm 14231 14232 ds_add_u32 v2, v4 offset:16 14233 ds_write_src2_b64 v2 offset0:4 offset1:8 14234 ds_cmpst_f32 v2, v4, v6 14235 ds_min_rtn_f64 v[8:9], v2, v[4:5] 14236 14237For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 14238Manual. 14239 14240FLAT 14241++++ 14242 14243.. code-block:: nasm 14244 14245 flat_load_dword v1, v[3:4] 14246 flat_store_dwordx3 v[3:4], v[5:7] 14247 flat_atomic_swap v1, v[3:4], v5 glc 14248 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 14249 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 14250 14251For full list of supported instructions, refer to "FLAT instructions" in ISA 14252Manual. 14253 14254MUBUF 14255+++++ 14256 14257.. code-block:: nasm 14258 14259 buffer_load_dword v1, off, s[4:7], s1 14260 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 14261 buffer_store_format_xy v[1:2], off, s[4:7], s1 14262 buffer_wbinvl1 14263 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 14264 14265For full list of supported instructions, refer to "MUBUF Instructions" in ISA 14266Manual. 14267 14268SMRD/SMEM 14269+++++++++ 14270 14271.. code-block:: nasm 14272 14273 s_load_dword s1, s[2:3], 0xfc 14274 s_load_dwordx8 s[8:15], s[2:3], s4 14275 s_load_dwordx16 s[88:103], s[2:3], s4 14276 s_dcache_inv_vol 14277 s_memtime s[4:5] 14278 14279For full list of supported instructions, refer to "Scalar Memory Operations" in 14280ISA Manual. 14281 14282SOP1 14283++++ 14284 14285.. code-block:: nasm 14286 14287 s_mov_b32 s1, s2 14288 s_mov_b64 s[0:1], 0x80000000 14289 s_cmov_b32 s1, 200 14290 s_wqm_b64 s[2:3], s[4:5] 14291 s_bcnt0_i32_b64 s1, s[2:3] 14292 s_swappc_b64 s[2:3], s[4:5] 14293 s_cbranch_join s[4:5] 14294 14295For full list of supported instructions, refer to "SOP1 Instructions" in ISA 14296Manual. 14297 14298SOP2 14299++++ 14300 14301.. code-block:: nasm 14302 14303 s_add_u32 s1, s2, s3 14304 s_and_b64 s[2:3], s[4:5], s[6:7] 14305 s_cselect_b32 s1, s2, s3 14306 s_andn2_b32 s2, s4, s6 14307 s_lshr_b64 s[2:3], s[4:5], s6 14308 s_ashr_i32 s2, s4, s6 14309 s_bfm_b64 s[2:3], s4, s6 14310 s_bfe_i64 s[2:3], s[4:5], s6 14311 s_cbranch_g_fork s[4:5], s[6:7] 14312 14313For full list of supported instructions, refer to "SOP2 Instructions" in ISA 14314Manual. 14315 14316SOPC 14317++++ 14318 14319.. code-block:: nasm 14320 14321 s_cmp_eq_i32 s1, s2 14322 s_bitcmp1_b32 s1, s2 14323 s_bitcmp0_b64 s[2:3], s4 14324 s_setvskip s3, s5 14325 14326For full list of supported instructions, refer to "SOPC Instructions" in ISA 14327Manual. 14328 14329SOPP 14330++++ 14331 14332.. code-block:: nasm 14333 14334 s_barrier 14335 s_nop 2 14336 s_endpgm 14337 s_waitcnt 0 ; Wait for all counters to be 0 14338 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 14339 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 14340 s_sethalt 9 14341 s_sleep 10 14342 s_sendmsg 0x1 14343 s_sendmsg sendmsg(MSG_INTERRUPT) 14344 s_trap 1 14345 14346For full list of supported instructions, refer to "SOPP Instructions" in ISA 14347Manual. 14348 14349Unless otherwise mentioned, little verification is performed on the operands 14350of SOPP Instructions, so it is up to the programmer to be familiar with the 14351range or acceptable values. 14352 14353VALU 14354++++ 14355 14356For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 14357the assembler will automatically use optimal encoding based on its operands. To 14358force specific encoding, one can add a suffix to the opcode of the instruction: 14359 14360* _e32 for 32-bit VOP1/VOP2/VOPC 14361* _e64 for 64-bit VOP3 14362* _dpp for VOP_DPP 14363* _sdwa for VOP_SDWA 14364 14365VOP1/VOP2/VOP3/VOPC examples: 14366 14367.. code-block:: nasm 14368 14369 v_mov_b32 v1, v2 14370 v_mov_b32_e32 v1, v2 14371 v_nop 14372 v_cvt_f64_i32_e32 v[1:2], v2 14373 v_floor_f32_e32 v1, v2 14374 v_bfrev_b32_e32 v1, v2 14375 v_add_f32_e32 v1, v2, v3 14376 v_mul_i32_i24_e64 v1, v2, 3 14377 v_mul_i32_i24_e32 v1, -3, v3 14378 v_mul_i32_i24_e32 v1, -100, v3 14379 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 14380 v_max_f16_e32 v1, v2, v3 14381 14382VOP_DPP examples: 14383 14384.. code-block:: nasm 14385 14386 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 14387 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14388 v_mov_b32 v0, v0 wave_shl:1 14389 v_mov_b32 v0, v0 row_mirror 14390 v_mov_b32 v0, v0 row_bcast:31 14391 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 14392 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14393 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 14394 14395VOP_SDWA examples: 14396 14397.. code-block:: nasm 14398 14399 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 14400 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 14401 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 14402 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 14403 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 14404 14405For full list of supported instructions, refer to "Vector ALU instructions". 14406 14407.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 14408 14409Code Object V2 Predefined Symbols 14410~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14411 14412.. warning:: 14413 Code object V2 is not the default code object version emitted by 14414 this version of LLVM. 14415 14416The AMDGPU assembler defines and updates some symbols automatically. These 14417symbols do not affect code generation. 14418 14419.option.machine_version_major 14420+++++++++++++++++++++++++++++ 14421 14422Set to the GFX major generation number of the target being assembled for. For 14423example, when assembling for a "GFX9" target this will be set to the integer 14424value "9". The possible GFX major generation numbers are presented in 14425:ref:`amdgpu-processors`. 14426 14427.option.machine_version_minor 14428+++++++++++++++++++++++++++++ 14429 14430Set to the GFX minor generation number of the target being assembled for. For 14431example, when assembling for a "GFX810" target this will be set to the integer 14432value "1". The possible GFX minor generation numbers are presented in 14433:ref:`amdgpu-processors`. 14434 14435.option.machine_version_stepping 14436++++++++++++++++++++++++++++++++ 14437 14438Set to the GFX stepping generation number of the target being assembled for. 14439For example, when assembling for a "GFX704" target this will be set to the 14440integer value "4". The possible GFX stepping generation numbers are presented 14441in :ref:`amdgpu-processors`. 14442 14443.kernel.vgpr_count 14444++++++++++++++++++ 14445 14446Set to zero each time a 14447:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14448encountered. At each instruction, if the current value of this symbol is less 14449than or equal to the maximum VGPR number explicitly referenced within that 14450instruction then the symbol value is updated to equal that VGPR number plus 14451one. 14452 14453.kernel.sgpr_count 14454++++++++++++++++++ 14455 14456Set to zero each time a 14457:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 14458encountered. At each instruction, if the current value of this symbol is less 14459than or equal to the maximum VGPR number explicitly referenced within that 14460instruction then the symbol value is updated to equal that SGPR number plus 14461one. 14462 14463.. _amdgpu-amdhsa-assembler-directives-v2: 14464 14465Code Object V2 Directives 14466~~~~~~~~~~~~~~~~~~~~~~~~~ 14467 14468.. warning:: 14469 Code object V2 is not the default code object version emitted by 14470 this version of LLVM. 14471 14472AMDGPU ABI defines auxiliary data in output code object. In assembly source, 14473one can specify them with assembler directives. 14474 14475.hsa_code_object_version major, minor 14476+++++++++++++++++++++++++++++++++++++ 14477 14478*major* and *minor* are integers that specify the version of the HSA code 14479object that will be generated by the assembler. 14480 14481.hsa_code_object_isa [major, minor, stepping, vendor, arch] 14482+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 14483 14484 14485*major*, *minor*, and *stepping* are all integers that describe the instruction 14486set architecture (ISA) version of the assembly program. 14487 14488*vendor* and *arch* are quoted strings. *vendor* should always be equal to 14489"AMD" and *arch* should always be equal to "AMDGPU". 14490 14491By default, the assembler will derive the ISA version, *vendor*, and *arch* 14492from the value of the -mcpu option that is passed to the assembler. 14493 14494.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 14495 14496.amdgpu_hsa_kernel (name) 14497+++++++++++++++++++++++++ 14498 14499This directives specifies that the symbol with given name is a kernel entry 14500point (label) and the object should contain corresponding symbol of type 14501STT_AMDGPU_HSA_KERNEL. 14502 14503.amd_kernel_code_t 14504++++++++++++++++++ 14505 14506This directive marks the beginning of a list of key / value pairs that are used 14507to specify the amd_kernel_code_t object that will be emitted by the assembler. 14508The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 14509amd_kernel_code_t values that are unspecified a default value will be used. The 14510default value for all keys is 0, with the following exceptions: 14511 14512- *amd_code_version_major* defaults to 1. 14513- *amd_kernel_code_version_minor* defaults to 2. 14514- *amd_machine_kind* defaults to 1. 14515- *amd_machine_version_major*, *machine_version_minor*, and 14516 *amd_machine_version_stepping* are derived from the value of the -mcpu option 14517 that is passed to the assembler. 14518- *kernel_code_entry_byte_offset* defaults to 256. 14519- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 14520 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 14521 Note that wavefront size is specified as a power of two, so a value of **n** 14522 means a size of 2^ **n**. 14523- *call_convention* defaults to -1. 14524- *kernarg_segment_alignment*, *group_segment_alignment*, and 14525 *private_segment_alignment* default to 4. Note that alignments are specified 14526 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 14527- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 14528 GFX90A onwards. 14529- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 14530 GFX10 onwards. 14531- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 14532 14533The *.amd_kernel_code_t* directive must be placed immediately after the 14534function label and before any instructions. 14535 14536For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 14537comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 14538 14539.. _amdgpu-amdhsa-assembler-example-v2: 14540 14541Code Object V2 Example Source Code 14542~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14543 14544.. warning:: 14545 Code Object V2 is not the default code object version emitted by 14546 this version of LLVM. 14547 14548Here is an example of a minimal assembly source file, defining one HSA kernel: 14549 14550.. code:: 14551 :number-lines: 14552 14553 .hsa_code_object_version 1,0 14554 .hsa_code_object_isa 14555 14556 .hsatext 14557 .globl hello_world 14558 .p2align 8 14559 .amdgpu_hsa_kernel hello_world 14560 14561 hello_world: 14562 14563 .amd_kernel_code_t 14564 enable_sgpr_kernarg_segment_ptr = 1 14565 is_ptr64 = 1 14566 compute_pgm_rsrc1_vgprs = 0 14567 compute_pgm_rsrc1_sgprs = 0 14568 compute_pgm_rsrc2_user_sgpr = 2 14569 compute_pgm_rsrc1_wgp_mode = 0 14570 compute_pgm_rsrc1_mem_ordered = 0 14571 compute_pgm_rsrc1_fwd_progress = 1 14572 .end_amd_kernel_code_t 14573 14574 s_load_dwordx2 s[0:1], s[0:1] 0x0 14575 v_mov_b32 v0, 3.14159 14576 s_waitcnt lgkmcnt(0) 14577 v_mov_b32 v1, s0 14578 v_mov_b32 v2, s1 14579 flat_store_dword v[1:2], v0 14580 s_endpgm 14581 .Lfunc_end0: 14582 .size hello_world, .Lfunc_end0-hello_world 14583 14584.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards: 14585 14586Code Object V3 and Above Predefined Symbols 14587~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14588 14589The AMDGPU assembler defines and updates some symbols automatically. These 14590symbols do not affect code generation. 14591 14592.amdgcn.gfx_generation_number 14593+++++++++++++++++++++++++++++ 14594 14595Set to the GFX major generation number of the target being assembled for. For 14596example, when assembling for a "GFX9" target this will be set to the integer 14597value "9". The possible GFX major generation numbers are presented in 14598:ref:`amdgpu-processors`. 14599 14600.amdgcn.gfx_generation_minor 14601++++++++++++++++++++++++++++ 14602 14603Set to the GFX minor generation number of the target being assembled for. For 14604example, when assembling for a "GFX810" target this will be set to the integer 14605value "1". The possible GFX minor generation numbers are presented in 14606:ref:`amdgpu-processors`. 14607 14608.amdgcn.gfx_generation_stepping 14609+++++++++++++++++++++++++++++++ 14610 14611Set to the GFX stepping generation number of the target being assembled for. 14612For example, when assembling for a "GFX704" target this will be set to the 14613integer value "4". The possible GFX stepping generation numbers are presented 14614in :ref:`amdgpu-processors`. 14615 14616.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 14617 14618.amdgcn.next_free_vgpr 14619++++++++++++++++++++++ 14620 14621Set to zero before assembly begins. At each instruction, if the current value 14622of this symbol is less than or equal to the maximum VGPR number explicitly 14623referenced within that instruction then the symbol value is updated to equal 14624that VGPR number plus one. 14625 14626May be used to set the `.amdhsa_next_free_vgpr` directive in 14627:ref:`amdhsa-kernel-directives-table`. 14628 14629May be set at any time, e.g. manually set to zero at the start of each kernel. 14630 14631.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 14632 14633.amdgcn.next_free_sgpr 14634++++++++++++++++++++++ 14635 14636Set to zero before assembly begins. At each instruction, if the current value 14637of this symbol is less than or equal the maximum SGPR number explicitly 14638referenced within that instruction then the symbol value is updated to equal 14639that SGPR number plus one. 14640 14641May be used to set the `.amdhsa_next_free_spgr` directive in 14642:ref:`amdhsa-kernel-directives-table`. 14643 14644May be set at any time, e.g. manually set to zero at the start of each kernel. 14645 14646.. _amdgpu-amdhsa-assembler-directives-v3-onwards: 14647 14648Code Object V3 and Above Directives 14649~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14650 14651Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 14652architecture processors, and are not OS-specific. Directives which begin with 14653``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 14654``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 14655:ref:`amdgpu-processors`. 14656 14657.. _amdgpu-assembler-directive-amdgcn-target: 14658 14659.amdgcn_target <target-triple> "-" <target-id> 14660++++++++++++++++++++++++++++++++++++++++++++++ 14661 14662Optional directive which declares the ``<target-triple>-<target-id>`` supported 14663by the containing assembler source file. Used by the assembler to validate 14664command-line options such as ``-triple``, ``-mcpu``, and 14665``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 14666:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 14667 14668.. note:: 14669 14670 The target ID syntax used for code object V2 to V3 for this directive differs 14671 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 14672 14673.amdhsa_kernel <name> 14674+++++++++++++++++++++ 14675 14676Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 14677``<name>.kd``, in the current location of the current section. Only valid when 14678the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 14679instruction to execute, and does not need to be previously defined. 14680 14681Marks the beginning of a list of directives used to generate the bytes of a 14682kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 14683Directives which may appear in this list are described in 14684:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 14685be valid for the target being assembled for, and cannot be repeated. Directives 14686support the range of values specified by the field they reference in 14687:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 14688assumed to have its default value, unless it is marked as "Required", in which 14689case it is an error to omit the directive. This list of directives is 14690terminated by an ``.end_amdhsa_kernel`` directive. 14691 14692 .. table:: AMDHSA Kernel Assembler Directives 14693 :name: amdhsa-kernel-directives-table 14694 14695 ======================================================== =================== ============ =================== 14696 Directive Default Supported On Description 14697 ======================================================== =================== ============ =================== 14698 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 14699 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14700 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 14701 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14702 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 14703 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14704 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX10 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2 14705 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` 14706 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 14707 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14708 GFX940) 14709 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 14710 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14711 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 14712 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14713 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 14714 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14715 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 14716 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14717 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 14718 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14719 GFX940) 14720 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 14721 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14722 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 14723 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14724 Specific 14725 (wavefrontsize64) 14726 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 14727 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14728 GFX940) 14729 ``.amdhsa_enable_private_segment`` 0 GFX940 Controls ENABLE_PRIVATE_SEGMENT in 14730 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14731 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 14732 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14733 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 14734 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14735 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 14736 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14737 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 14738 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14739 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 14740 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14741 Possible values are defined in 14742 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 14743 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 14744 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 14745 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14746 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 14747 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14748 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14749 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file. 14750 GFX940 Used to calculate ACCUM_OFFSET in 14751 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14752 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 14753 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14754 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14755 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 14756 (except scratch memory. Used to calculate 14757 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in 14758 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14759 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 14760 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 14761 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14762 (xnack) 14763 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 14764 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14765 Possible values are defined in 14766 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14767 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 14768 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14769 Possible values are defined in 14770 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 14771 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 14772 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14773 Possible values are defined in 14774 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14775 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 14776 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14777 Possible values are defined in 14778 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 14779 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 14780 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14781 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 14782 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14783 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 14784 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14785 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in 14786 Feature GFX940 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 14787 Specific 14788 (tgsplit) 14789 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 14790 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 14791 Specific 14792 (cumode) 14793 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 14794 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14795 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 14796 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 14797 ``.amdhsa_shared_vgpr_count`` 0 GFX10 Controls SHARED_VGPR_COUNT in 14798 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 14799 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 14800 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14801 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 14802 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14803 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 14804 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14805 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 14806 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14807 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 14808 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14809 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 14810 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14811 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 14812 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 14813 ======================================================== =================== ============ =================== 14814 14815.amdgpu_metadata 14816++++++++++++++++ 14817 14818Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 14819note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`). 14820 14821The contents must be in the [YAML]_ markup format, with the same structure and 14822semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`, 14823:ref:`amdgpu-amdhsa-code-object-metadata-v4` or 14824:ref:`amdgpu-amdhsa-code-object-metadata-v5`. 14825 14826This directive is terminated by an ``.end_amdgpu_metadata`` directive. 14827 14828.. _amdgpu-amdhsa-assembler-example-v3-onwards: 14829 14830Code Object V3 and Above Example Source Code 14831~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14832 14833Here is an example of a minimal assembly source file, defining one HSA kernel: 14834 14835.. code:: 14836 :number-lines: 14837 14838 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 14839 14840 .text 14841 .globl hello_world 14842 .p2align 8 14843 .type hello_world,@function 14844 hello_world: 14845 s_load_dwordx2 s[0:1], s[0:1] 0x0 14846 v_mov_b32 v0, 3.14159 14847 s_waitcnt lgkmcnt(0) 14848 v_mov_b32 v1, s0 14849 v_mov_b32 v2, s1 14850 flat_store_dword v[1:2], v0 14851 s_endpgm 14852 .Lfunc_end0: 14853 .size hello_world, .Lfunc_end0-hello_world 14854 14855 .rodata 14856 .p2align 6 14857 .amdhsa_kernel hello_world 14858 .amdhsa_user_sgpr_kernarg_segment_ptr 1 14859 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 14860 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 14861 .end_amdhsa_kernel 14862 14863 .amdgpu_metadata 14864 --- 14865 amdhsa.version: 14866 - 1 14867 - 0 14868 amdhsa.kernels: 14869 - .name: hello_world 14870 .symbol: hello_world.kd 14871 .kernarg_segment_size: 48 14872 .group_segment_fixed_size: 0 14873 .private_segment_fixed_size: 0 14874 .kernarg_segment_align: 4 14875 .wavefront_size: 64 14876 .sgpr_count: 2 14877 .vgpr_count: 3 14878 .max_flat_workgroup_size: 256 14879 .args: 14880 - .size: 8 14881 .offset: 0 14882 .value_kind: global_buffer 14883 .address_space: global 14884 .actual_access: write_only 14885 //... 14886 .end_amdgpu_metadata 14887 14888This kernel is equivalent to the following HIP program: 14889 14890.. code:: 14891 :number-lines: 14892 14893 __global__ void hello_world(float *p) { 14894 *p = 3.14159f; 14895 } 14896 14897If an assembly source file contains multiple kernels and/or functions, the 14898:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 14899:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 14900the ``.set <symbol>, <expression>`` directive. For example, in the case of two 14901kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 14902to group the function with the kernel that calls it and reset the symbols 14903between the two connected components: 14904 14905.. code:: 14906 :number-lines: 14907 14908 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 14909 14910 // gpr tracking symbols are implicitly set to zero 14911 14912 .text 14913 .globl kern0 14914 .p2align 8 14915 .type kern0,@function 14916 kern0: 14917 // ... 14918 s_endpgm 14919 .Lkern0_end: 14920 .size kern0, .Lkern0_end-kern0 14921 14922 .rodata 14923 .p2align 6 14924 .amdhsa_kernel kern0 14925 // ... 14926 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 14927 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 14928 .end_amdhsa_kernel 14929 14930 // reset symbols to begin tracking usage in func1 and kern1 14931 .set .amdgcn.next_free_vgpr, 0 14932 .set .amdgcn.next_free_sgpr, 0 14933 14934 .text 14935 .hidden func1 14936 .global func1 14937 .p2align 2 14938 .type func1,@function 14939 func1: 14940 // ... 14941 s_setpc_b64 s[30:31] 14942 .Lfunc1_end: 14943 .size func1, .Lfunc1_end-func1 14944 14945 .globl kern1 14946 .p2align 8 14947 .type kern1,@function 14948 kern1: 14949 // ... 14950 s_getpc_b64 s[4:5] 14951 s_add_u32 s4, s4, func1@rel32@lo+4 14952 s_addc_u32 s5, s5, func1@rel32@lo+4 14953 s_swappc_b64 s[30:31], s[4:5] 14954 // ... 14955 s_endpgm 14956 .Lkern1_end: 14957 .size kern1, .Lkern1_end-kern1 14958 14959 .rodata 14960 .p2align 6 14961 .amdhsa_kernel kern1 14962 // ... 14963 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 14964 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 14965 .end_amdhsa_kernel 14966 14967These symbols cannot identify connected components in order to automatically 14968track the usage for each kernel. However, in some cases careful organization of 14969the kernels and functions in the source file means there is minimal additional 14970effort required to accurately calculate GPR usage. 14971 14972Additional Documentation 14973======================== 14974 14975.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 14976.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 14977.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 14978.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 14979.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 14980.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 14981.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 14982.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 14983.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 14984.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 14985.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 14986.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 14987.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 14988.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 14989.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 14990.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 14991.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 14992.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 14993.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 14994.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 14995.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 14996.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 14997.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 14998.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 14999